Methods of encoding and combining integer lists in a computer system, and computer software product for implementing such methods

ABSTRACT

A range covering integers of an input list is divided into subsets according to a predetermined pattern. The encoding produces coding data including, for each subset containing at least one integer of the input list, data representing the position of this subset in the pattern, and data representing the position of each integer of the input list within this subset. This encoding process may be iterated in several coding layers. It supports very efficient methods for combining the coded integer lists.

BACKGROUND OF THE INVENTION

[0001] The present invention relates to the methods of handling integerlists in computer systems.

[0002] In a non-limiting manner, the invention is applicable in thefield of relational database management systems (RDBMS), where theinteger lists may represent identifiers of records in various tables.

[0003] It is well known that, in computer systems, integer lists mayequivalently be stored and handled in the explicit form of integer listsor in the form of bitmap vectors. A bitmap vector has binary componentseach indicating whether an integer corresponding to the rank of thecomponent belongs (1) or does not belong (0) to the list. The dimensionof the vector has to be at least equal to the largest integer of thelist.

[0004] The bitmap representation is convenient because a variety ofmanipulations can be performed on the coded lists by subjecting thebinary components of the vectors to Boolean operations, which are themost basic operations in the usual processors. For example integer listsare readily intersected by means of the Boolean AND operation, merged bymeans of the Boolean OR operation, complemented by means of the BooleanNOT operation, etc.

[0005] When the integers of the lists are potentially big, the dimensionof the bitmap vectors becomes large, so that the memory space requiredto store the lists in that form becomes a problem. When the lists arescarcely filled with integers of the big range, the explicit integerformat is much more compact: a list of K integers in the range [0,2³²[requires K×32 bits vs. 2³²≈4.3 billion bits in the bitmap format.

[0006] Bitmap compression methods have been proposed to overcome thislimitation of the bitmap representation. These methods consist inlocating regions of the vectors whose components have a constant value,so as to encode only the boundaries of those regions. The remainingregions can be coded as bitmap segments. An appreciable gain is achievedwhen very large constant regions are found. Examples of such bitmapcompression methods as disclosed in U.S. Pat. Nos. 5,363,098 and5,907,297.

[0007] This type of bitmap compression optimizes the storage of theencoded integer lists, but not their handling. Multiple comparisons arerequired to detect overlapping bitmap segments when performing basicBoolean operation on the bitmaps (see U.S. Pat. No. 6,141,656). This isnot computationally efficient. In addition, when the coding data of theconstant regions and bitmap segments are stored in memory devices suchas hard drives (i.e. not in RAM), numerous disc read operations arenormally required, which is detrimental to the processing speed.

[0008] An object of the present invention is to propose alternativemethods of encoding and/or combining integer lists, whereby lists ofpotentially large dimension can be efficiently handled.

SUMMARY OF THE INVENTION

[0009] The invention proposes a method of encoding integer lists in acomputer system, comprising the steps of:

[0010] dividing a range covering integers of an input list into subsetsaccording to a predetermined pattern; and

[0011] producing coding data including, for each subset containing atleast one integer of the input list, data representing the position ofsaid subset in the pattern, and data representing the position of eachinteger of the input list within said subset.

[0012] The invention further proposes a method of encoding integer listsin a computer system, comprising n successive coding layers, n being anumber at least equal to 1. In each coding layer, the method comprisesthe steps of:

[0013] dividing a range covering integers of an input list of said layerinto subsets according to a predetermined pattern;

[0014] producing coding data including, for each subset containing atleast one integer of the input list, data representing the position ofeach integer of the input list within said subset and, at least if saidlayer is the last coding layer, data representing the position of saidsubset in the pattern;

[0015] if said layer is not the last coding layer, forming a furtherinteger list representing the position, in the pattern of said layer, ofeach subset containing at least one integer of the input list, andproviding said further integer list as an input list of the next layer.

[0016] Another aspect of the invention relates to a computerized methodof combining a plurality of first integer lists into a second integerlist, wherein at least one of the first integer lists is represented bystored coding data provided by a coding scheme comprising n successivecoding layers as outlined above The combining method comprises the stepsof:

[0017] defining a combination of intermediary lists each correspondingto at least one of the first integer lists;

[0018] for k decreasing from n to 1, computing a layer k result list bycombining a plurality of layer k intermediary lists in accordance withsaid combination; and

[0019] producing the second integer list as the layer 1 result list.

[0020] For any intermediary list corresponding to at least one firstinteger list represented by stored coding data, the layer n intermediarylist is determined from said stored coding data as consisting of theintegers of any layer n input list associated with said at least onefirst integer list in the coding scheme. If n>1, each layer kintermediary list for k<n is determined from said stored coding data andthe layer k+1 result list as consisting of any integer of a layer kinput list associated with said at least one first integer list in thecoding scheme which belongs to a layer k subset whose position isrepresented in the layer k+1 result list.

[0021] Another aspect of the invention relates to computer programproducts having instructions for carrying out methods as outlined above.

BRIEF DESCRIPTION OF THE DRAWINGS

[0022] FIGS. 1-3 show an example of data structure as typically used ina conventional relational database system.

[0023]FIG. 4 is a diagram representing a data table tree in the exampleof FIGS. 1-3.

[0024] FIGS. 5-7 are diagrams showing respective data graphs constructedwith the tree of FIG. 4 and the data of FIGS. 1-3.

[0025]FIG. 8 is a flat file representation of the data tables of FIGS.1-3.

[0026]FIG. 9 shows a link table as used in an embodiment of theinvention.

[0027] FIGS. 10A-H show the contents of thesauruses corresponding to thedata tables of FIGS. 1-3.

[0028] FIGS. 11A-14A, 11G-14G and 11H-14H show other representations ofthe thesauruses of FIGS. 10A, 10G and 10H, respectively.

[0029] FIGS. 15-16 illustrate the data stored in a data container inconnection with the thesauruses of FIGS. 14A, 14G and 14H.

[0030]FIG. 17 shows another possible structure of the thesaurus offigures 10A-14A.

[0031]FIG. 18 is a block diagram of a computer system suitable forimplementing the invention.

[0032]FIG. 19 is a flow chart showing a data graph creation procedure inaccordance with an embodiment the invention.

[0033]FIG. 20 is a flow chart showing a procedure applicable in stage124 of figure 19.

[0034]FIGS. 21 and 22 are flow charts showing procedures applicable instep 136 of FIG. 20.

[0035]FIGS. 23 and 24 are flow charts showing another procedureapplicable in step 136 of FIG. 20 in two successive coding layers.

[0036] FIGS. 25-32 are tables showing a way of storing thesaurusesconstructed from the example of FIGS. 1-3.

[0037]FIG. 33 is a flow chart showing an alternative way of executingsteps 135 and 136 of FIG. 20 when the thesauruses are stored as shown inFIG. 17.

[0038]FIGS. 34A and 34B are tables showing an alternative embodiment ofthe tables of FIGS. 31-32.

[0039]FIGS. 34C and 34D are another representation of the tables ofFIGS. 34A and 34B.

[0040]FIG. 35 is a flow chart showing a procedure applicable in themanagement of tables of the type shown in FIGS. 34A and 34B.

[0041]FIG. 36 is a general flow chart of a query processing procedure inaccordance with an embodiment of the invention.

[0042]FIG. 37 is a diagram showing an example of query tree referring tothe example of FIGS. 1-3.

[0043]FIG. 38 is another diagram showing an expanded query tree obtainedby analyzing the query tree of FIG. 37.

[0044]FIG. 39 is a flow chart showing a procedure of analyzing the querytree.

[0045]FIG. 40, which is obtained by placing FIG. 40A above FIG. 40B, isa flow chart of a recursive function referred in the procedure of FIG.39.

[0046]FIG. 41 is the flow chart procedure for identifying matching datagraphs based on an expanded query tree as illustrated in FIG. 38.

[0047]FIG. 42 is a flow chart of a recursive function FNODE called to inthe procedure of FIG. 41.

[0048] FIGS. 43-45 are flow charts illustrating procedures executed insteps 262, 264 and 265 of FIG. 42, respectively.

[0049]FIG. 46 is a flow chart showing an alternative embodiment of theprocedure of step 265 of FIG. 42.

[0050]FIG. 47 is a flow chart showing another alternative embodiment ofthe procedure of step 265 of FIG. 42, when the thesauruses are stored asillustrated in FIGS. 34A and 34B.

[0051]FIG. 48 is a flow chart of a recursive function FILT called in theprocedure of FIG. 47.

[0052]FIG. 49 is a flow chart showing another alternative embodiment ofthe procedure of step 265 of FIG. 42, when the thesauruses are stored asillustrated in FIG. 17.

[0053]FIG. 50 is a flow chart of a variant of a leaf processing used inthe function of FIG. 42.

[0054]FIG. 51 is a flow chart showing a procedure applicable forscanning the thesaurus relating to a given attribute in order toretrieve the attribute values relevant to a database query.

[0055]FIG. 52 is a flow chart of a function FINTER referred to in theprocedure of FIG. 51.

[0056] FIGS. 53-55 are flow charts showing procedures executed in steps355, 357 and 358 of FIG. 52, respectively.

[0057]FIG. 56 is a flow chart showing an alternative procedureapplicable in step 358 of FIG. 52, when the thesauruses are stored asillustrated in FIGS. 33-34.

[0058]FIG. 57 is a flow chart of a recursive function FFILT called inthe procedure of FIG. 56.

[0059] FIGS. 58-61 show tables which may be stored to cooperate with thetables of FIGS. 25-34.

[0060]FIG. 62 is a flow chart showing a pre-filtering procedure whichmay be used prior to a thesaurus scanning similar to that of FIG. 51.

[0061]FIG. 63 is a flow chart showing a part of a thesaurus scanningprocedure according to FIG. 51, adapted to take into account apre-filtering according to FIG. 62.

[0062]FIG. 64 is a flow chart showing an alternative procedureapplicable in step 358 of FIG. 52, when the thesauruses are stored asillustrated in FIG. 17.

[0063]FIG. 65 is a flow chart showing a procedure applicable in step 335of FIG. 51.

[0064]FIGS. 66 and 67 show the contents of an exemplary output tableused to provide a query response.

[0065]FIG. 68 is a diagram illustrating another possible structure ofthe output table.

[0066]FIGS. 69 and 70 are flow charts showing procedures applicable instep 335 of FIG. 51 to construct an output table of the type shown inFIG. 68.

[0067]FIG. 71 is a flow chart showing a procedure applicable in step 335of FIG. 51 to perform computations in a database system by means of acomputation table.

[0068]FIG. 72 is a block diagram of another computer system suitable forimplementing the invention.

DESCRIPTION OF PREFERRED EMBODIMENTS

[0069] The invention is described herebelow in its non-limitingapplication to database management systems.

Virtual Data Graphs

[0070] FIGS. 1-3 illustrate a collection of data which can be stored ina computer memory coupled with a processor arranged for runningrelational database management programs. This example will be referredto in the following description to give an illustration of theprinciples and embodiments of the invention where appropriate.

[0071] FIGS. 1-3 show a conventional type of data organization in adatabase system. The illustrated system handles data relevant to ahypothetical insurance company which manages policies for its clients.The data are organized in three tables relating to the clients, policiesand accidents as shown in FIGS. 1-3, respectively.

[0072] From a logical point of view, each data table consists of atwo-dimensional matrix, with rows corresponding to respective records inthe table and columns corresponding to respective data attributes of therecords or structural features of the database (the latter type ofcolumn typically contains either local record identification keys orforeign keys designating records in a target table).

[0073] It will be appreciated, however, that for large databases theactual storage of the data in a memory medium, e.g. a magnetic disc, isfrequently performed otherwise: each row typically has a memory addresswhere the corresponding attribute values or keys are stored in the orderof the columns and separated by predetermined symbols such as theencoded character “\”.

[0074] In our simplified example given to facilitate the explanation ofthe proposed data structures, the tables are of modest size. Inpractice, there are usually more tables and more attributes (columns)per table (notwithstanding, one ore more tables could also have a singlecolumn). Moreover, the data tables generally include much more records,up to thousands or millions of rows depending on the application.

[0075] In that example, the database a group of seven attributesdistributed into three sub-groups corresponding to the three datatables. Each attribute has a column in the data table corresponding toits sub-group. The client data table (FIG. 1) has three attributes, i.e.client name, birth year and gender. The policy data table of FIG. 2 hastwo attributes, i.e. policy type (“car” or “house”) and policy effectdate, and a link column to the client table. The accident data table ofFIG. 3 has two attributes, i.e. date of accident and amount of damagesincurred in a given currency, and a link column to the policy table.

[0076] In a given data table, each record/row has a unique identifier,referred to as a row-ID. This identifier corresponds to the memoryaddress where the record is stored, usually through a conversion table.It may be stored as an identification key in a column of the data tablefor the purposes of unique row identification, but this is notcompulsory. In our example, the row-ID's are integer indexes startingfrom zero for each data table, and they are not stored explicitly in acolumn of the table.

[0077] Some of the tables are linked together, as indicated in the lastcolumn of FIGS. 2 and 3. Two tables are directly linked if one of them(source table) has a link column provided for containing foreign keysdesignating records of the other one (target table).

[0078] Those foreign keys, hereafter called links, reflect the hierarchyand organization of the data handled in the relational database system.In our example, each accident dealt with by the insurance company isrelated to a certain policy managed by the company, hence the policylinks of FIG. 3. Each policy is for a particular client of the company,hence the client links of FIG. 2. It will be noted that some links maybe optional. For example, some accidents may involve third parties andif there is a separate table for third parties, then each record of theaccident table may have a link to the third party table.

[0079] Each link typically consists of a row-ID in the target datatable. For instance, the accident stored as row-ID=0 in the accidenttable of FIG. 3, which took place on Oct. 3, 1998 for an amount of 1,000has a policy link pointing to the policy stored as row-ID=1 in thepolicy table of FIG. 2, i.e. it relates to a car policy subscribed onSep. 9, 1998 by the client with row-ID=1 in the client table of FIG. 1,i.e. André, a man born in 1976. If the target table has other forms ofrecord identification keys, for example compound keys, a link may alsodesignate a target record as identified by such a key.

[0080] The construction of the links obeys a number of rules. Inparticular, the linked data tables have a directed acyclic graphstructure such as a hierarchical tree organization illustrated in FIG.4. A root table is defined as a data table for which no other data tablehas links pointing to its rows, such as the accident table of FIG. 3. Inother words, a root table does not constitute a target table. Likewise,a leaf table is defined as a data table with no link column, such as theclient table of FIG. 1. In other words, a leaf table does not constitutea source table. FIG. 4 shows only one root table, but the tree structureof the tables may have multiple roots.

[0081] It may happen in certain cases that a group of related datatables exhibit circular links (for example, the client table may have alink column to the accident data table to indicate the first, or last,accident undergone by each client). In such a case, the treeorganization of the data tables is first restored by canceling one linkof the circle. Which link should be cancelled is dictated by thesemantics of the database (in the above example, the link from theclient table to the accident table will naturally be cancelled).

[0082] Paths are defined in the data table tree from the root table(s)to the leaf tables. Each path from a root table to a leaf table isdefined by a link column of the root table pointing to the leaf table,or by a succession of link columns via one or several intermediatetables.

[0083] In FIG. 4, two leaf tables have been added (dashed lines) to showa tree structure with multiple branching (the simplified example ofFIGS. 1-3 provides a tree with a single path shown with a solid line).The added leaf tables are a third party table as mentioned previouslyand a broker table which is a target table from the policy table, tocontain data about the brokers who commercialize the policies.

[0084] The data table records that are linked together can be viewed ina similar tree representation (FIGS. 5-7). The record tree of FIG. 5shows that the accident #6 was related to policy #0 (car) subscribed byclient #2 (Ariane) through broker #Y and involved third party #X. Thesolid lines represent respective links from the data tables of FIGS. 2and 3.

[0085] The record tree of FIG. 6 further shows a Null record which mayadded in the accident table with a link to row-ID=2 in the policy table,for the reason that, as apparent from the last column of FIG. 3, noaccident has occurred under policy #2 (subscribed by client #4 (Max) forhis house).

[0086] A Null, or dummy, record stands for the absence of data. All itsattribute values are default values (Null), which means “no value”. Thepurpose of inserting such dummy records in the present scheme is to makesure that any valid record in any data table belongs to at least onerecord tree stemming from a record of a root table (FIG. 4).

[0087] A Null record may also be present in each data table which is atarget table for at least one link column of a source table. When a rowof the source table has no foreign key in the corresponding link column,the record tree(s) including that row is (are) completed with a Null atthe location of said target table. This situation occurs for the brokertable in the example illustrated in FIG. 6. To represent this, a defaultvalue (e.g. −1) can be written in the link column of the source table,whereby the Null record is implicitly present in the target table.

[0088] The Null records are inserted where appropriate in a process ofscanning every single path in the data table tree from the leaf table ofsaid path to the root table, i.e. downwardly in FIG. 4. When examiningone source/target table pair in the scanning of a path, the target tablerow-ID values that do not occur in the relevant link column of thesource table are first listed, and then for each missing row-ID value ofthe list, a new Null record is generated in the source table with saidmissing row-ID value in said link column.

[0089] If a Null record is thus inserted in a data table having severallink columns, the Null record receives the default value (−1) in anylink column other that the one pertaining to the path being scanned, toindicate that the corresponding link is to a Null record in the targettable. This situation occurs for the third party table in the exampleillustrated in FIG. 6.

[0090] Scanning the data table tree from the leaves to the root isimportant. Otherwise, Null records containing links to other Nullrecords in a target table might be overlooked. An example is shown inFIG. 7 which shows a record tree relating to client #0 (Oscar) who hasno (more) policy: the accident table contains a Null record pointing toanother Null record of the policy table which, in turn, points to client#0; the root of the record tree would not be in the root (accident)table if the paths were scanned upwardly.

[0091] In a conventional database organization as shown in FIGS. 1-3,the link keys are provided to optimize the memory usage. To illustratethis, reference may be made to the flat file shown in FIG. 8, which hasexactly the same informational content as the three data tables of FIGS.1-3 (the third party and broker tables are ignored in the sequel).

[0092] A flat file has a column for each one of the attributes (columns)of the data tables. For each complete record tree that can beconstructed with the data table tree structure of FIG. 4, the flat filehas a row which contains, in the relevant columns, the attribute valuesof all the records of said tree. The rows of the flat file are referredto herein as data graphs. Each data graph is identified by a flat filerow-ID shown in the left-hand portion of FIG. 8. The record trees ofFIGS. 5-7 are compact representations of the data graphs at row-ID's 6,9 and 11, respectively.

[0093] Although the flat file representation is sometimes referred tothe literature, it is of little practical interest for databases ofsignificant size. The reason is that it requires excessive redundancy inthe data storage.

[0094] For example, in our small-sized case, André's birth year andgender, as well as the details of his car policy are written three timesin the flat file (row-ID's 0, 3 and 8), whereas they are written onlyonce, along with link values, when the storage is in the form of datatables as in FIGS. 1-3. With databases of realistic size, suchredundancy is not acceptable.

[0095] The database system according to the invention makes use of theflat file concept. However, it does not require the storage of the flatfile as shown in FIG. 8, hence the concept of “virtual flat file”containing “virtual data graphs” (VDG). The term “virtual” refers to thefact that the flat file or data graphs need not be maintained explicitlyin memory, although their data structure is used as a reference in theexecution of the method.

[0096] In a particular embodiment of the invention, the flat file isreduced to a link table as shown in FIG. 9. Each row of the link tablecorresponds to a respective row of the flat file, i.e. to a record treeas shown in FIGS. 5-7.

[0097] The columns of the link table respectively correspond to the datatables of FIGS. 1-3. In other words, each column of the link table isassociated with an attribute sub-group which is the sub-group ofattributes allocated to the corresponding (target) data table. Eachcolumn of the link table contains link values (row-ID's) designatingrecords of the corresponding target data table.

[0098] The row of the link table corresponding to a given data graphcontains a default value (−1) in the column corresponding to any datatable having a Null record in the record tree representing said datagraph.

[0099] The data table row-ID's found in one row of the link table enablethe retrieval of linked data from the data table, i.e. a data graph orpart of it. All the links are represented in the link table. If onereplaces the row-ID's stored in the columns of the link table of FIG. 9by the attribute values stored in the identified rows of the respectivedata tables of FIGS. 1-3, one recovers the flat file of FIG. 8.

[0100] The proposed system further uses word thesauruses (FIGS. 10A-G)each associated with a respective column of one of the data tables, i.e.with one of the attributes.

[0101] In a preferred embodiment, there is one word thesaurus for eachattribute used in the database system. However, if some attributes areknown to be never or almost never used in the query criteria, then it ispossible to dispense with the thesaurus for such attribute.

[0102] Each word thesaurus associated with one column of a data tablehas an entry for each attribute value found in that column. Suchattribute value is referred to herein as a “word”. A word has one entryin a thesaurus, and only one, as soon as it occurs at least once in theassociated data table column. The Null value is a valid word in thethesaurus.

[0103] The entries of each thesaurus are sorted on the basis of theattribute values. An order relationship is therefore defined for eachattribute category. This requires attention when the attribute valuefields of the thesaurus files are defined and dimensioned.

[0104] Typically, the words are in the ASCII format and their categoryis selected for each column among the categories “integer”, “real” and“character string”. Character strings are sorted according to the usuallexicographical order. A date field is preferably declared as acharacter string such as yyyy (mm) (dd) (FIGS. 10B, 10E and 10F), yyyyrepresenting the year, mm the month (optionally) and dd the day in themonth (optionally). The thesaurus sorting thus puts any dates in thechronological order. If the attribute category is “integer”, the numbersare aligned on the right-hand digit, in order to provide the naturalorder relationship among the integer data values. If the attributecategory is “real”, the numbers are aligned according to their wholeparts, with as many digits on the right as in the value having thelongest decimal part in the column.

[0105] The Null value is at one end (e.g. at the beginning) of eachsorted thesaurus.

[0106] Each entry E(W) for a word W in a thesaurus associated with acolumn C(T) of a data table T contains information for identifying everyrow of the flat file which has the attribute value W in the columncorresponding to C(T). When the flat file is stored virtually in theform of a link table, the information contained in entry E(W) is usedfor identifying every row of the link table which, in the columncorresponding to the data table T, has a link pointing to a row havingthe value W in column C(T).

[0107] In other words, with the contents of the entry E(W) in thethesaurus associated with column C(T), we can retrieve all the datagraphs whose corresponding attribute has the value W.

[0108] Such contents represent a row-ID list pointing to rows of the(virtual) flat file, i.e. a data graph identifier list. Such list may beempty, in particular for the Null value in some of the thesauruses (asin FIGS. 10A-C).

[0109] Two alternative representations of the data graph identifierlists in the thesauruses are illustrated in FIGS. 10A-G for the sevenattribute columns of FIGS. 1-3. The first one is the form of explicitinteger lists.

[0110] The second (equivalent) representation is in the form of bitmapvectors whose length is equal to (or greater than) the number of rows inthe virtual flat file, i.e. the number of data graphs. The bit ofposition i in a bitmap vector (i≧0) indicates whether the integer ibelongs (1) or not (0) to the row-ID list represented by the bitmapvector. In our simplified example, the flat file has 12 rows so that thebitmap vectors may be of dimension 12.

[0111] The above-described data structure, comprising a virtual flatfile and sorted thesaurus files pointing to rows of the virtual flatfile is referred to herein as a VDG structure.

[0112] The VDG structure provides a powerful tool for efficientlyprocessing queries in the database.

[0113] The virtual flat file is a reference table which defines aunified algebraic framework for the entries of all the thesauruses. Thequery criteria are examined with reference to the relevant thesaurusesto obtain a flat file row-ID list (or bitmap vector) which representsall data graphs matching the query criteria, if any. The results canthen be delivered by accessing the link table rows pointed to in thatrow-ID list to read the links which appear in part or all of the columnsin order to retrieve attributes values as desired for the resultpresentation.

[0114] The processing with reference to the thesauruses mainly consistsin logical operations performed on the row-ID lists to which they point.If they are represented as integer lists, such operations can be reducedto basic merge, intersect and/or complement operations, whichrespectively correspond to Boolean OR, AND, NOT operations in the bitmaprepresentation.

[0115] The VDG structure also provides an efficient tool for accessingthe contents of the database, which does not require accesses to thedata tables. This tool is well suited to queries having special resultpresentation features such as SORT, COUNT, DISTINCT, ORDER BY, GROUP BY,etc. clauses, and also for carrying out any type of calculation on thedata values of the records which match the query.

EXAMPLE 1

[0116] As an illustration, consider the following query: find the clientname and accident date for all car accidents that incurred damageshigher than 900, and group the results according to the client name. Thequery may be processed as follows. First, all the flat file row-ID listsidentified in the accident amount thesaurus entries relating to amountshigher than 900 (the five last rows of FIG. 10G) are merged, whichyields the list {0, 1, 3, 5, 6, 7} (or the bitmap vector 110101110000obtained by a bitwise Boolean OR). Then the intersection of that listwith the row-ID list identified in the policy type thesaurus entryrelating to the value “car” (the second row of FIG. 10D) is determined.The result list {0, 3, 5, 6} (or bitmap vector 100101100000 obtained bya bitwise Boolean AND) specifies the data graphs that satisfy the querycriteria. Finally, the entries of the client name thesaurus (FIG. 10A)are read sequentially and when there is a non-empty intersection betweenthe result list and the row-ID list identified in the client namethesaurus entry, the link table rows having their row-ID's in thatintersection are read to retrieve the desired attribute values. In ourcase, the output would be: André [accident dates 1998 10 03 (#0) and1999 06 12 (#3)], Ariane [accident date 1999 12 09 (#6)] and Laure[accident date 1999 12 08 (#5)].

[0117] The above type of processing is applicable to any kind of query.The response is prepared by referring only to the sorted thesauruses,which implicitly refer to the flat file framework. Once an output flatfile row-ID list is obtained, the link table or the thesauruses can beused for retrieving the data of interest.

EXAMPLE 2

[0118] To further illustrate the outstanding performance of the VDGscheme, let us consider the query which consists in identifying anyclient who has had a car accident before the beginning of the civil yearof his or her 35^(th) birthday. In a typical conventional system, allthe records of the accident data table of FIG. 3 have to be read toobtain the date attribute and policy link values. For each accidentrecord, the policy data table is read at the row-ID found in the policylink column to obtain the policy type attribute and client link valuesand then, if the policy type is “car”, another access to the client datatable is necessary at the row-ID found in the client link column toobtain the birth year attribute value. The latter value is compared withthe date attribute value previously obtained in the accident table todetermine whether the criteria of the query are fulfilled.

[0119] If the data tables are sorted beforehand on the basis of theattributes referred to in the query criteria, such conventionalprocessing may be accelerated by limiting the number of disc accesses.This requires data table sorting every time records are added, deletedor amended, which is not practical when the collection of data is large.And it is possible only in specific cases dictated by the data tablesorting rule.

[0120] For example, if the client and policy tables were respectivelysorted on the basis of the birth year and policy type attributes, thepreceding request could be processed in a less prohibitive time byaccessing the data records in a suitable order and with the help of thecomputer cache memory. However, the tip would not apply to other similarqueries (e.g., assuming an additional column in the policy table forcontaining excess amounts, the identification of all accidents for whichthe damage amount was more than ten times the excess amount would raisethe same problems).

[0121] With the VDG scheme, the above illustrative query can be dealtwith in a very efficient manner. By means of the client birth yearthesaurus (FIG. 10B) and the accident date thesaurus (FIG. 10G), thecomputer identifies the {client birth year, accident date} word pairswhich satisfy the date criterion, i.e. accident date earlier thanbeginning of client's birth year+35. This is done without worrying aboutwhether the accident was undergone by the client. Such identification isrelatively easy for any possible pair of attributes since any attributelikely to be referred to in queries has a sorted thesaurus. For eachidentified word pair, the intersection of the two flat file row-ID listsof the thesaurus entries is obtained. The resulting integer lists aremerged. Then the computer intersects the row-ID list of the entryrelating to the value “car” in the policy type thesaurus (second row inFIG. 10D) with the list {0, 1, 3, 5, 6, 8, 10} resulting from themerger. The resulting list {0, 3, 5, 6, 8} designates a set of matchingrows in the link table, from which the relevant client names (André—3times—, Laure and Ariane) are readily retrieved by accessing the clienttable records whose row-ID's appear in the matching rows and in theclient column of the link table.

[0122] It is noted that, when processing a query, the link table issimply used as a mean to retrieve the data of interest. Different waysof achieving this retrieval function may be thought of.

[0123] A method is to keep the original data tables (FIGS. 1-3) inmemory. However, it is worth noting that the link columns may be deletedfrom those data tables, since their contents are already present in thelink table.

[0124] From the observation that all possible attribute values arestored in the corresponding thesauruses, another method is to store inthe link table pointers to the thesauruses. The latter method reducesthe required disc space since an attribute value has to be written onlyonce, even if the value occurs frequently in a data table column. Itenables to quickly retrieve attribute values which occur in a given flatfile row without requiring the use of the original data tables.

[0125] For certain attributes, it may be interesting to store theexplicit attribute values in the link table, i.e. like in the flat file.In particular, this may be interesting for numerical fields (usually ofsmaller size than character strings) whose values are very dispersed andwhich are often requested among the output attributes of a queryresponse (e.g. money amounts). If those values are explicitly written inthe link table, there can be an appreciable gain in the disc accessesrequired for fetching the output data, at the cost of a moderateincrease in the needed disc space.

[0126] In the foregoing explanations, the link table is a sort ofskeleton of the flat file, which is stored to facilitate the dataretrieval once the list of flat file row-ID's matching the query hasbeen determined by means of the sorted thesauruses.

[0127] Notwithstanding, storing a link table or any form of tablereflecting the flat file structure is not strictly necessary. In anadvantageous embodiment, the data graphs (or their portions requestedfor the result presentation) may be recovered from the thesaurus filesonly. To illustrate this, consider again Example 2. Once the result list{0, 3, 5, 6, 8} of matching virtual flat file rows has been obtained byprocessing the query criteria with reference to the thesaurus files, itis possible to scan the client name thesaurus and, for each word (clientname), to intersect the flat file row-ID list represented in thethesaurus with the result list. If the intersection is non-empty, theword is included in the output. It may be accompanied with theintersection list to allow the user to quickly obtain furtherinformation from the relevant data graphs. This method requires theminimum memory space since only the thesaurus files need to be stored.

[0128] Even if a link table is stored, it may be advantageous, forcertain queries, to retrieve the attribute values by scanning thethesaurus(es) as indicated hereabove rather than through the link table.This may occur, in particular, to perform computations on the datavalues when there is a relatively slow interface between the queryprocessor and the data tables, e.g. an ODBC interface (“Open DataBaseConnectivity”).

[0129] Another advantage of the VDG scheme is that it provides a queryprocessing engine which can co-exist with the data tables in theiroriginal form. Changes in the thesaurus entries are then done inresponse to corresponding changes in the original data tables. This isan interesting feature for users who find it important to keep theirdata in the form of conventional tables, because they do not want to betoo dependent on a new system or because they need to access theirtables through a conventional interface for other applications.

Macrowords

[0130] The above-described VDG's are advantageously completed withprefix thesauruses also referred to as macroword thesauruses.

[0131] Like the above-described word thesauruses, each macrowordthesaurus is associated with one attribute, i.e. one column of one datatable. In addition, it has a prefix length (or truncation length)parameter.

[0132] Each entry of the macroword thesaurus relates to a range ofattribute values, and contains or points to data for identifying all theflat file rows having, in the column corresponding to said attribute, anattribute value which falls within said range. The range correspondingto the entry of the macroword thesaurus corresponds to a prefix valuehaving the prefix length assigned to the thesaurus: any word beginningby such prefix value has its flat file row-ID list included in that ofthe macroword. If the prefix length is noted P, a macroword C₁C₂ . . .C_(P) is the set of all values of the attribute which begin by the Pcharacters or digits C₁C₂ . . . C_(P). The limit case where the prefixlength is the number of characters or digits of the value field (i.e.truncation length is zero) is the word thesaurus described previously.

[0133] In other words, the macroword thesaurus entry identifies the flatfile row-ID list (or bitmap vector) corresponding to the merger of theflat file row-ID lists (or to the logical OR between the bitmap vectors)which are identified in the entries of the word thesaurus correspondingto the individual words encompassed by the macroword.

[0134] Each thesaurus (word or macroword) associated with an attributeAT can thus be defined with reference to a partition into subsets of theset of words which can be assigned to attribute AT in the relevant datatable record. It has a respective entry for each subset including atleast one word assigned to attribute AT, this entry being associatedwith a flat file row-ID list including any ID of a flat file row havinga word of the subset assigned to attribute AT. In the case of amacroword thesaurus, the partition is such that each subset consists ofwords beginning by a common prefix. In the case of a word thesaurus, thepartition is such that each subset consists of only one word.

[0135] As an example, FIG. 10H shows the accident amount macrowordthesaurus for a truncation length of 3 characters. It is not necessaryto repeat the Null entry, which is already in the word thesaurus. Such amacroword thesaurus provides substantial economy in terms of discaccesses and flat file row-ID list mergers. For example, for obtaininginformation about the accidents that had an amount between 1,000 and1,999, one access to the macroword thesaurus of FIG. 10H is enough toobtain the relevant list of flat file row-ID's {0, 2, 3}, whereas itwould require two thesaurus accesses and one merge operation with thenon-truncated accident amount thesaurus of FIG. 10G. The gain can bequite substantial for large databases and attributes of highcardinality, i.e. with many possible attribute values.

[0136] Macroword thesauruses based on prefix or truncation lengthsprovide a great flexibility in the processing of range-based querycriteria. It is possible, for a given attribute, to provide severalmacroword thesauruses having different prefix lengths in order tooptimize the processing speed of various queries.

[0137] Typically, a date attribute may have a yearly macroword thesaurus(prefix length=4) and a monthly thesaurus (prefix length=6) in additionto the (daily) word thesaurus. Any other kind of attribute (numbers ortext) may lend itself to a convenient macroword thesaurus hierarchy.

VDG Compression

[0138] With the VDG scheme as described so far, the memory spacerequired by the thesaurus files is not optimized.

[0139] The row-ID's being integers typically coded with 32 bits, if aword occurs N times in the attribute column of the flat file of FIG. 8,n×32 bits are needed to explicitly encode its flat file row-ID lists. Ifthe flat file has N_(max) rows (for example millions of rows), N_(max)bits are needed for each entry in the bitmap representation, forwhatever value of N.

[0140] Generally speaking, for an attribute of high cardinality, such asthe date or amount attributes (FIGS. 10E-G), the flat file row-ID listsare scarcely filled, so that the explicit integer list representation issatisfactory in terms of memory requirement, while the bitmaprepresentation can be prohibitive for large flat files. Other attributeshave a low cardinality, such as the client gender or policy typeattribute in our example (FIGS. 10C-D), whereby the bitmaprepresentation is well suited, while the integer list representation isunfavorable.

[0141] It is possible to adopt for each thesaurus a representation whichis believed to be the most appropriate in order to reduce the neededmemory space. However, this requires an a priori knowledge of how theattribute values will be distributed. Many attributes can be ambiguousin this respect, and the optimization may also be difficult fordifferent macroword sizes relating to a given attribute.

[0142] Bitmap compression methods as known in the art (e.g. U.S. Pat.Nos. 5,363,098 or No. 5,907,297) may also be used. A problem is thatthose methods are designed essentially for optimizing the storagevolume, not the processing speed. In the VDG context, the advantage ofreduced memory space may be counterbalanced by the disadvantage oflonger response times due to multiple compression and/or decompressionoperations when processing a query. To the contrary, it is desired toincrease the processing speed as much as possible.

[0143] In the preferred implementation of the VDG scheme, thecompression of the flat file row-ID lists in the thesauruses is carriedout by dividing a range covering all the row-IDs of the flat file intosubsets according to a predetermined pattern. Then, each flat filerow-ID list of a thesaurus entry is encoded with data for locating inthe pattern each subset of the range which contains at least one row-IDof the list, and data representing the position of each integer of therow-ID list within any subset thus located.

[0144] The row-ID range [0, N_(max)[is selected to be equal to or largerthan the number of rows in the flat file. The “predetermined pattern”conveniently defines the “subsets” as consecutive intervals [0, D1−1[,[D1, 2×D1−1[, etc., having the same length D1 within said range.

[0145] The coding data can then be produced very simply by Euclideandivision. For any positive numbers x and y, we note └x┘ the integerequal to or immediately below x, ┐x┌ the integer equal to or immediatelyabove x, and x mod y=x−└x/y┘. A Euclidean division by D1 is performedfor each row-ID N of the input list. The quotient Q1=└N/D1┘ indicatesthe rank of the corresponding interval in the pattern (Q1≧0), while theremainder R1=N mod D1 represents the position of the row-ID within theinterval (0≦R1<D1). The decoding is also very simple: from the encodingdata Q1 and R1 for an item of the coded list, the row-ID is N=Q1×D1+R1.

[0146] Advantageously, the interval length is a whole power of 2, sothat the Euclidean divisions are performed by straightforward bit shiftoperations. A typical length is D1=2⁷=128.

[0147] The encoding method can be expressed equivalently by referring tothe bitmap representation. Each bitmap vector is divided into bitmapsegments (or other types of bit groups if a more tortuous pattern isreferred to), and for each segment containing at least one “1”, thecoding data include the rank (=Q1) and the contents of the segment. Theall zero segments are discarded.

[0148]FIGS. 11A, 11G and 11H are other presentations of the client nameand accident amount word thesauruses of FIGS. 10A and 10G and of theaccident amount macroword thesaurus of FIG. 10H, with D1=3 (anon-typical value of D1 used here for conciseness). The second columnsare copied from the last columns of FIGS. 10A, 10G and 10H,respectively, with blanks to highlight the segmentation of the bitmapvectors. The third columns show the lists of ranks (=Euclidean quotientsQ1) resulting from the encoding, and the fourth columns show thecorresponding non-zero bitmap segments (having a 1 at the position ofeach remainder R1).

[0149] It is observed that for each thesaurus entry, the ranks Q1 forman integer list included in the range [0, N1_(max)[, withN1_(max)=┐N_(max)/D1┌.

[0150] According to a preferred embodiment of the invention, a similartype of encoding can be applied to those rank lists. The encodingprocess may be iterated several times, with the same encoding pattern ordifferent ones. In particular, the interval lengths could vary from oneiteration to the next one. They are preferably whole powers of 2.

[0151] The ranks and bitmap segments obtained in the first iterationwith the interval length D1 are called layer 1 (or L1) ranks and layer 1segments (FIGS. 11A, 11G and 11H). Those obtained in the seconditeration, with an interval length noted D2, are called layer 2 (or L2)ranks and layer 2 segments (FIGS. 12A, 12G and 12H), and so forth.

[0152] In the following, n denotes the number of encoding layersnumbered k with 1≦k≦n, layer k having a divisor parameter Dk, and theproduct${\Delta \quad k} = {\prod\limits_{k^{\prime} = 1}^{k - 1}\quad {Dk}^{\prime}}$

[0153] being the number of flat file row-ID's encompassed by one bit ofa layer k bitmap segment (Δ1=1).

[0154] In the simplified case illustrated in FIGS. 12A, 12G and 12H, n=2and the second encoding layer uses D2=2. The columns labeled “L1 Bitmap”are a bitmap representation of the layer 1 rank lists, with blanks tohighlight the further bitmap segmentation leading to the layer 2 datashown in the last two columns.

[0155] The layer 1 and layer 2 coding data are summarized in FIGS. 13A,13G and 13H which show a possible way of storing the flat file row-IDlist information. It is noted that storage of the layer 1 rank lists isnot strictly necessary since those list are completely defined by thelayer 2 data. However, it will be appreciated further on that suchstorage somewhat simplifies the query processing in certain embodimentsof the invention.

[0156] The same kind of encoding may be used for any one of the word andmacroword thesauruses. However, it is also possible for some of them toretain a conventional type of row-ID list storage (explicit integerlists or bitmap vector), i.e. n=0. In particular, the explicit integerlist representation may remain well-suited for scarcely distributedthesauruses.

[0157] FIGS. 14-16 show another possible way of storing the informationcontained in the thesauruses of FIGS. 13A, 13G and 13H. For eachencoding layer, the thesaurus entries are associated with respectivechains of records in a data container (FIG. 15 for layer 1 and FIG. 16for layer 2) including a rank file and a bitmap segment file. Eachrecord in the layer k rank file (1≦k≦n) has a field for receiving a rankvalue (between 0 and Nk_(max)−1) and a field for receiving an address ofa next record in the rank file. A default value in the next addressfield (0 in the example shown) means that the record is the last one ofthe chain. The bitmap segment file (right-hand parts of FIGS. 15 and 16)is addressed in the same manner as the associated rank file. In eachrecord for layer k, its has a bitmap field of Dk bits for receiving thebitmap segment associated with the rank stored in the correspondingrecord of the rank file. It will be appreciated that the rank values andnext record addresses could also be stored in two separated files havinga common addressing rather than in two fields of the same file.

[0158] For each VDG coding layer k, an entry in a thesaurus has a headaddress field for containing an address in the layer k rank file where afirst rank record concerning the entry is stored. From there, therelevant rank chain can be retrieved. For example, Max's layer 1 ranks0, 2 and 3 (FIG. 13A) are retrieved by accessing the rank file of FIG.15 at the address 29 indicated in the head address field of thethesaurus entry (FIG. 14A), and then at the chained addresses 27 and 15.In parallel, the corresponding layer 1 bitmap segments 001, 010 and 100are read. FIGS. 15 and 16 also show that the rank and bitmap segmentfiles have an additional chain consisting of free records (addresses31/32/33/17 in FIG. 15 and 29/8/17/24 in FIG. 16). The head of thelatter chain is allocated to write new coding data when necessary.

[0159] Preferably, the thesaurus entry further has a layer 1 tailaddress field for containing the address in the rank file of the lastrecord of the chain pertaining to the entry, as shown in the thirdcolumns of FIGS. 14A, 14G and 14H. This facilitates the updating of theencoding data storage. For instance, the insertion of a new layer 1 rankfor Max, with a corresponding layer 1 bitmap segment, may proceed asfollows: the head of the free record chain is located (address 31); theaddress (32) found in its next record address field becomes the addressof the new free record chain head; the records at address 31 receivesthe new layer 1 rank in the rank field, the end-of-chain flag (0) in thenext address field and the new bitmap segment in the segment field,respectively; the address obtained in the tail address field of Max'sthesaurus entry (15) is accessed directly (bypassing the potentiallylong path along the chain) to write the address (31) of the new data,which is also written into the tail address field of Max's thesaurusentry. The fact that the layer 1 rank is a new one for Max can bedetermined from the layer 2 data: if the layer 2 updating performedpreviously has changed a “0” to a “1” in the layer 2 bitmap segment,then the layer 1 rank is a new one for the word; otherwise the layer 1rank is already present in Max's layer 1 rank list which has to bescanned until said layer 1 rank is found. If there are more than twoencoding layers, it is possible to provide a layer k tail address fieldin the thesaurus entries for k>1 and to proceed in the same manner fornew layer k ranks as determined from the layer k+1 data. However themain gain in doing so lies in layer 1 which has the longest chains.

[0160] In FIGS. 15 and 16, the coding data coming from threeheterogeneous thesauruses (client name thesaurus, accident amount wordthesaurus and accident amount macroword thesaurus) are stored in thesame data containers. The other thesauruses are ignored for clarity ofthe figures. In fact, all the coding data of one layer may be piled upin the same rank/bitmap segment files, irrespective of the word ormacroword thesaurus where they come from. Any entry of any thesauruswill then point to a respective record chain in those two coupled files.

[0161] In order to optimize the processing speed, it is preferable tosort the rank and bitmap segment files for disc storage, so as to groupthe records based on the thesaurus entries to which they pertain. Theadvantage in doing so is that the reading of the coding data for onethesaurus entry requires fewer disc accesses, by means of the computercache memory which enables the simultaneous RAM loading of a group ofphysically contiguous records. A batch execution of that optimizationsorting, which requires a simultaneous update of the thesaurus entries(head and tail address fields), may be used to avoid untimely resourceusage.

[0162] In order to facilitate this optimization, it is preferable to useseparate data containers for different thesauruses, rather than commonfiles. This reduces the amount of data to be sorted each time. Inparticular, using one rank/bitmap segment file pair for each thesaurusand each coding layer seems appropriate.

[0163] A further possibility is to provide separate rank and bitmapsegment files for the different thesaurus entries. This requires ahigher number of file declarations in the memory. But it is optimal interms of processing speed without requiring the above-mentionedoptimization sorting operation. It also eliminates the need for storinghead and tail addresses pointing to record chains: the thesaurus entriessimply designate data containers where the rank and bitmap segment dataare stored.

[0164]FIG. 17 illustrates how the data of the client name thesaurus maybe arranged in the latter case. The thesaurus has an index registerwhere the thesaurus words are kept sorted. For each word and each codinglayer k, two files are provided in the system memory, one for containingthe rank data (noted NOk), and one for containing the bitmap segments(noted HPk). The attribute value (André, Ariane and so on) can be usedto name the corresponding files. The storage is less compact than withcommon data containers as shown in FIGS. 15-16, but access to the datarelating to one word can be very quick without requiring any sorting.

[0165] An arrangement as illustrated in FIG. 17 is preferred if theoperating system does not suffer too severe limitations regarding thenumber of files that can be managed in the memory, and if the overheaddue to the storage of numerous individual files is not a problem.Otherwise, it is possible to group the rank and bitmap segment filesrelating to different (macro)words, or even to different thesauruses, asindicated before.

[0166] In addition to enhanced data compression, the multi-layer row-IDlist encoding method provides a substantial acceleration of most queryprocessing. The processing is first performed in the higher layer, andthe results are passed to the lower layers. The coding scheme preservesa common structure for the entries of all thesauruses in each layer,imprinted by the original structure imparted by the virtual flat file.Accordingly, collective logical operations between integer lists orbitmaps originating from different thesauruses are possible in thevarious layers. The results obtained in a layer k+1 provide a sort offilter for executing the minimum number of operations in layer k, whichenhances the processing efficiency, particularly for multi-attributequery criteria.

[0167] This enhancement is hardly visible on our simplified example,which is too small. Consider the following request: find Max's accidentsfor an amount of 1,300 (there is no response). The direct layer 1processing is to read and decode the relevant layer 1 data to rebuildthe bitmap vectors of the words “Max” and “1,300” in the thesauruses ofFIGS. 10A and 10G, and to compute the logical AND of the two bitmapvectors. Exactly the same kind of processing in layer 2 requires fewerread operations since there are fewer layer 2 records, and avoids anylayer 1 processing because there is no overlap between the two layer 1rank lists for the words “Max” and “1,300” (2^(nd) column of FIGS. 13Aand 13G). If the same request is made with the amount value 10,000instead of 1,300, the layer 2 results may reduce the layer 1 processingto loading the two layer 1 bitmap segments corresponding to rank 0 (theother ranks are filtered out) and computing the AND between thosesegments.

[0168] With more representative values of D1 and D2 (e.g. D1=D2=128) anda large size database, this filtering principle between two layersprovides a spectacular gain. Large pieces of bitmap vectors disappearfrom the layer 1 (or generally layer k>1) processing owing to thegroupwise filtering achieved in layer 2 (layer k+1).

VDG Creation and Management

[0169]FIG. 18 shows an exemplary layout of a computer system suitablefor forming the hardware platform of a system in accordance with theinvention. That hardware platform may be of conventional type. It has abus 100 for exchanging digital signals between a plurality of unitsincluding:

[0170] a central processing unit (CPU) 101;

[0171] a read only memory (ROM) 102 for containing basic operatinginstructions of the CPU;

[0172] a random access memory (RAM) 103 which provides a working spacefor the CPU 101, dynamically containing program instructions andvariables handled by the CPU;

[0173] a man-machine interface 104 which comprises circuitry forcontrolling one or more display devices (or other kind of devices fordelivering information to humans) and circuitry for inputtinginformation to the computer system from acquisition devices such as akeyboard, mouse, digital pen, tactile screen, audio interface, etc.;

[0174] a mass storage device for storing data and computer programs tobe loaded into RAM 103. In the typical example shown in FIG. 18, themass storage device comprises a hard drive 105 for storing data on a setof magnetic discs 106. It will be appreciated that any kind of massstorage device, magnetic or optical, may be used in implementing theinvention.

[0175] For implementing the present invention, the hard drive unit 105is used for storing data structures as described in the foregoing andprograms described in more detail herebelow. The program instructionsand the useful data are loaded into the dynamic storage RAM 103 forprocessing by CPU 101. The query results are stored in the hard driveand/or delivered to a user through the man-machine interface 104 orthrough a network interface (not shown) in the case of a remote access.

[0176] The mass storage device 105 is suitable for the storage of largeamounts of data, but with an access time significantly longer than theRAM 103. This is due to the time needed to put the reading head of thehard drive in front of the desired disc location. As well-known in theart, when a disc access is performed in hard drive 105, the data thatare actually read form a block of data stored contiguously on the harddisc, which is loaded in a portion of RAM 103, called “cache” memory.When it is known that the CPU is likely to need different data piecessimultaneously or in a short period of time, it is convenient to arrangethe data storage organization such that those data belong to the sameblock so as to be retrievable by a single disc access, which minimizesthe processing time.

[0177] The system of FIG. 18 may be a personal computer (PC) of thedesktop or laptop type. It may also be a workstation or a mainframecomputer.

[0178] Of course, other hardware platforms may be used for implementingthe invention. In particular, those skilled in the art will appreciatethat many calculations performed on the bitmap segments and vectors lendthemselves to efficient implementation by means of dedicated logicalcircuits or coprocessors. Furthermore, parallel computation is verynatural in this system.

[0179] The process of creating the VDG data structure is now describedwith reference to FIG. 19 from input data tables being in the form shownin FIGS. 1-3, which is the most usual data representation. That creationprocess is thus suitable for creating the VDG structure from legacydatabases. From the VDG updating rules described further on, it will beunderstood that VDG's may also be created directly from brand new data.

[0180] In certain databases, the data tables have their rowscharacterized by compound keys rather than row-ID's as in FIGS. 1-3. Acompound key is the concatenation of the contents of several key fieldsof a data table. In a source data table, the records include foreignkeys which designate the compound keys of records of a target table. Ifsuch a legacy databases is handled, the first stage of the VDG creationprocedure is to translate the compound keys into single keys such as therow-ID's shown in FIGS. 1-3. This (optional) first stage is illustratedin box 120 in FIG. 19.

[0181] The second stage 121 consists in completing the data tables withNull records where appropriate. This is performed as discussed hereabovewith reference to FIGS. 4-7, by scanning every path in the data tabletree from the leaf table of the path to the root table. A link to a Nullrecord is denoted by the default value −1. As a result, for eachsource/target table pair, all the row-IDs of the target table arepresent at least once in the source table link column.

[0182] The next stage 122 comprises the creation of the wordthesauruses. The relevant attributes, i.e. those likely to be used inquery criteria (it may be all of them), are determined. For each of thedetermined attribute, the word format (type and length) is selected. Foreach word thesaurus, the attribute values occurring in the associatedcolumn, including the Null value, are read from the data table stored inthe hard drive 105. Repeated values are eliminated, and the remainingvalues are sorted based on the attribute values and the orderrelationship applicable to the type of attribute. This sorting operationmay be performed in successive data record blocks transferred from thehard drive 105 to the CPU cache memory, with an external sorting afterprocessing each block.

[0183] The VDG creation procedure then proceeds to a stage 123 ofdeciding the relevant macroword formats. Some word thesauruses will notgive rise to macroword thesauruses (for example, the client genderthesaurus of FIG. 10C needs no macrowords). In contrast, otherthesauruses, for example corresponding to date or amount attributes,will give rise to several macroword thesauruses having differenttruncation lengths. If the values found in an attribute column includecharacters strings beginning by most letters of the alphabet, it isconvenient to create a macroword thesaurus with a prefix length of onecharacter. The decision about the suitable macroword hierarchy may bemade by a database manager and input through the man-machine interface104. It may also be an automatic process, based on the attribute typeand/or the distribution of the words in the thesaurus. In stage 123, themacroword thesauruses are also created, directly in sorted form, byapplying the truncation to the words of the corresponding wordthesauruses and deleting the repeated macrowords.

[0184] Each entry of a macroword thesaurus preferably indicates thefirst word (or lower level macroword) of the lower level thesaurusincluded in the range covered by the macroword. This indication of thelowest word (or macroword) whose prefix matches the macroword underconsideration reduces the time needed to access the “children” of thatmacroword since the first one can be accessed without scanning the lowerlevel thesaurus. Alternatively, or cumulatively, the highest word (orlower level macroword) whose prefix matches the macroword could beindicated in the macroword thesaurus.

[0185] In stage 124, the rows of the link table and the entries of theindividual word thesauruses are generated. This is preferably donewithout storing the whole flat file (FIG. 8), for example according tothe algorithm illustrated in FIG. 20, in the case of an encoding withn=2 layers.

[0186] In the embodiments illustrated in FIGS. 20-32, it is assumed thateach entry of a thesaurus for an attribute value contains an index WIwhich forms a row-ID in an auxiliary table of the type shown in FIG.14A, 14G or 14H, pointing to coding data containers of the type shown inFIGS. 15 and 16. For each encoding layer k, this auxiliary table has:

[0187] a column for containing the address, noted AT_Fk(WI), of a firstrecord concerning the thesaurus word of index WI in the coding datacontainer relating to layer k;

[0188] a column for containing the address, noted AT_Lk(WI), of the lastrecord of the chain for thesaurus word of index WI in the datacontainer; as indicated before, the latter column may be present onlyfor layer 1.

[0189] As mentioned previously, the data container for a given codinglayer may be shared between all or part of the thesauruses, or it may beassociated with each individual thesaurus. A record at address AD (≧1)in the layer k container (here assumed to be common to all thesauruses)comprises a first field NOk(AD) for containing the rank data as aninteger ranging from 0 to Dk−1, a second field for containing theaddress NXk(AD) of the next record of the chain (this address is 0 ifthere is no further address), and a third field for containing thecorresponding bitmap segment HPk(AD). The layer k container has a freerecord chain whose first record address is noted Hk.

[0190] It is noted that the auxiliary table could also be shared byseveral thesauruses containing distinct word indexes to access suchcommon auxiliary table.

[0191] Before stage 124, all the records of the data container arechained together and free, and the bitmap segments HPk(AD) areinitialized with all zero segments. The columns AT_Fk and AT_Lk of allthe auxiliary tables are also initialized with the value 0.

[0192] The quotient and the remainder of the Euclidean division of aflat file row-ID by D1 are respectively noted Q1 and R1. For eachfurther layer k>1, Qk and Rk respectively denote the quotient andremainder of the Euclidean division of Q(k−1) by Dk. At theinitialization step 130 of FIG. 20, the integers Q1, R1, Q2 and R2 areset to 0.

[0193] The rows of the root table(s), which may be read one by one orblock by block from the hard drive 105, are selected one by one in step131. The records of the other data tables which are linked with theselected root table row are read in step 132. This provides a data graphof the type illustrated in compact form in FIGS. 5-7.

[0194] The links of those data graphs, i.e. the row-ID's in the datatables, are written into the relevant columns of the link table (FIG. 9)at row-ID Q1×D1+R1 (step 133). If there is no link table, step 133 isskipped.

[0195] For the current data graph, the different attributes AT aresuccessively selected (step 134). The value of the selected attribute ATis located by means of a dichotomic search in the correspondingthesaurus, and its word index WI is read in step 135. Step 136, whichwill be detailed hereafter with reference to FIGS. 21-24, consists inupdating the auxiliary table and data containers with respect to the ATthesaurus entry for the word index WI. This updating corresponds to theinsertion of the current flat file row-ID Q1×D1+R1 into the integer listrelating to the thesaurus word index WI.

[0196] When all the attributes have been thus handled (test 137), thelayer 1 remainder index R1 is incremented by one unit in step 138. Ifthe incremented R1 is equal to D1 (test 139), then the index R1 is resetto 0, and the layer 1 quotient index Q1 and layer 2 remainder index R2are each incremented by one unit in step 140. If the incremented R2 isequal to D2 (test 141), then the index R2 is reset to 0, and the layer 2quotient Q2 is incremented by one unit in step 142. After step 142, orwhen R1 <D1 in step 139 or R2<D2 in step 141, a test 143 is performed todetermine whether all the rows or all the root tables have beenconsidered. If not, the procedure comes back to step 131 to select a newroot table row.

[0197] Once all the root table rows have been considered, stage 124 ofFIG. 19 is over, and the parameters Q1, R1, Q2 and R2 are memorized forsubsequent insertion of possible new data records. Eventually, thenumber of rows in the virtual flat file is given by Q1×D1+R1.

[0198] Clearly, the procedure of FIG. 20 is readily extended to n>2encoding layers, by initializing all Qk and Rk parameters to 0 in step130 and by developing steps 138-142 (which are equivalent toincrementing the data graph pointer Q1×D1+R1) in the higher layers.

[0199]FIG. 21 shows how the program can manage the record chains in thedata container and the thesaurus auxiliary table in layer k≧1 for a wordindex WI in the thesaurus relating to an attribute AT. The first step150 is to load the value AT_Fk(WI) stored in the auxiliary table intothe address variable AD. If AD=0 (test 151), then a record chain has tobe initialized for thesaurus index WI, so that the head address Hk ofthe free record chain in the data container is assigned to AT_Fk(WI) instep 152.

[0200] If there was already a record chain for the thesaurus index WI(AD>0 at test 151), the rank NOk(AD) is loaded into the rank variable qin step 153. If the following test 154 shows that q is different fromthe quotient variable Qk, the address variable AD′ receives the addressof the next record of the chain, i.e. NXk(AD), in step 155. If AD′ isstill different from 0 (test 156), the process comes back to step 153for examining the next rank variable of the record chain, aftersubstituting AD′ for AD in step 157. When AD=0 in test 156, a datacontainer record has to be appended to the chain for thesaurus index WI,so that the head address Hk of the free record chain, in written intothe next record field NXk(AD) in step 158.

[0201] After step 152 or 158, the head address Hk of the free recordchain is loaded into the address variable AD in step 159. Step 160 isthen executed to update the auxiliary table and data container. Thisupdate operation 160 consists in:

[0202] replacing the head address Hk by the next address NXk(AD) of thefree chain;

[0203] writing the current value of the address variable AD intoAT_Lk(WI); and

[0204] writing Qk and 0, respectively, in the fields NOk(AD) and NXk(AD)of the data container.

[0205] After step 160, or when q=Qk in the above-mentioned test 154, thebitmap segment HPk(AD) is updated in step 161 by writing the digit “1”at bit position Rk of that segment.

[0206] In FIG. 20, it has been considered that both the layer 1 andlayer 2 coding data are updated in step 136. This means that theprocedure of FIG. 20 is executed once for k=1 and once for k=2. Anotherpossibility is to execute it only for k=1, and to generate the layer 2coding data subsequently, by processing the layer 1 rank data producedin stage 124.

[0207] It is worth noting that when initializing the VDG's from a legacydatabase as in FIG. 20, the rank data Qk appear in an increasing order(we always have q≦Qk in test 154 of FIG. 21). Accordingly, it ispossible to move directly to the record chain tail, i.e. to takeAD=AT_Lk(WI) instead of AD=AT_Fk(WI) in step 150. In this case, step 158is executed directly when Qk>q in test 154, thereby avoiding thescanning of the record chain. Such embodiment is illustrated in FIG. 22.

[0208] In the latter embodiment, once the VDG initialization is over,the layer k tail address fields AT_Lk with k>1 may be discarded.However, if the VDG management is such that any new VDG likely to beinserted has a flat file row-ID equal to or greater than all the flatfile row-ID's of the existing VDG's (i.e. the flat file row of anydeleted VDG will not be used any more), then it is advantageous to keepall the tail address fields AT_Lk in order to perform any subsequentupdate in accordance with the embodiment of FIG. 22.

[0209] In the form depicted in FIG. 21, the update procedure isapplicable independently of any hypothesis on the rank values Qk.

[0210]FIGS. 23 and 24 show an alternative method of updating theauxiliary table and data containers with respect to the AT thesaurusentry for the word index WI in step 136, which takes advantage of thetail address field AT_L1 of the auxiliary table in layer 1 (with n=2coding layers). FIG. 23 illustrates the layer 2 processing which isperformed before the layer 1 processing of FIG. 24. Most of the steps ofFIGS. 23-24 are very similar to steps of FIG. 21, so that correspondingreference numerals have been used.

[0211] The layer 2 processing of FIG. 23 is essentially the same as thatof FIG. 21 (k=2), with the following differences:

[0212] it is not necessary to deal with tail address fields AT_L2(WI) instep 160;

[0213] step 161 further includes setting to “1” the binary variable LL1,which means that the current layer 1 rank data Q1 does not belong to thelayer 1 record chain relating to the word index WI;

[0214] when q=Q2 in test 154, another test 164 is made to determinewhether the bit position R2 of the layer 2 segment HP2(AD) contains thevalue “1”; step 161 follows only if that test 164 is negative;

[0215] if test 164 is positive, the current layer 1 rank data Q1 alreadybelongs to the layer 1 record chain relating to the word index WI, sothat the variable LL1 is set to “0” in step 165.

[0216] The layer 1 processing of FIG. 24 begins at step 170 by testingwhether LL1 is 0 or 1. If LL1=0, step 150 is executed to load the valueAT_F1(WI) stored in the layer 1 auxiliary table into the addressvariable AD, and a loop 153-155 is executed to find the data containeraddress AD where the data relating to the rank Q1 are stored. Steps 153and 154 are the same as in FIG. 21, and in step 155 the next addressNX1(AD) is directly loaded into the address variable AD (AD is never 0because LL1=0). The program proceeds to step 161 when q=Q1 in test 154.

[0217] If LL1=1 in test 170, step 171 is executed to load the value ATL1 (WI) stored in the layer 1 auxiliary table into the address variableAD. If AD=0 (test 172), the sequence of steps 152, 159-161 is executedas in FIG. 21 (however, it is not necessary to deal with next addressfields NX1(AD) in step 160). If AD=1 in test 172, the sequence of steps158-161 is executed as in FIG. 21.

[0218] The procedure of FIGS. 23-24 avoids the scanning of the layer 1record chains when the rank data Q1 are not in such chains, without anyhypothesis on the rank values.

[0219] After all the coding data for the individual word thesauruseshave been generated, the next stage 125 of the procedure shown in FIG.19 is to rearrange the stored coding data. As indicated previously, thisis done to organize the record chains in the coding data container ofeach layer so that records pertaining to the same thesaurus word havecontiguous addresses in order to be accessible in one or few discaccesses by means of the CPU cache memory. A simple way to do this is toreserve memory space for a new auxiliary table and new coding datacontainers. The thesaurus words are considered one by one, and for eachof them, the coding data pointed to in the old auxiliary table are readsequentially and copied into the new data container at an address ADincremented after each write operation. When proceeding to the nextthesaurus word index WI+1, new pointers AT_Lk(WI)=AD−1 andAT_Fk(WI+1)=AD are determined and stored into the new auxiliary table.After all the coding data records have been thus read and rewritten intothe new data container, the old data container and auxiliary table arediscarded.

[0220] Such rearrangement can be performed separately for each codinglayer k.

[0221] If there are several data containers for different thesauruses ina coding layer, they may also be reordered separately.

[0222] As indicated before, the rearrangement step 125 is dispensed withwhen the thesauruses are organized in the manner illustrated by FIG. 17,since the coding data files naturally fulfil the grouping condition withrespect to the thesaurus words.

[0223] In the following stage 126 of the procedure shown in FIG. 19, themacroword thesaurus entries are generated. For each macroword and eachlayer, this is done simply by merging the rank coding data Q1, Q2 of thewords (or lower level macrowords) covered by the macroword, and byobtaining the corresponding bitmap segments by a logical OR of thoserelating to the words (or lower level macrowords). If the coding datahave been rearranged for the word thesauruses as indicated in stage 125,the same grouping of the coding data will automatically be achieved forthe macroword thesauruses.

[0224] In stage 127, the now useless link columns of the original datatables (FIGS. 1-3) can be deleted. The Null records which have beenadded in stage 121 can also be deleted, their occurrence being indicatedby the default value −1 in the link table (FIG. 9).

[0225] Finally, the elements to be stored in the hard drive 105 in theabove-described embodiment are:

[0226] the data tables as illustrated in FIGS. 1-3, without the linkcolumns. Parameters defining the data table tree structure of FIG. 4 arestored in association with the tables;

[0227] the link table as illustrated in FIG. 9;

[0228] the sorted thesauruses comprising an index register and anauxiliary table for each desired attribute. FIGS. 25-26 show the indexregisters for the attributes AT=CN (“client name”) and AT=AA (“accidentamount”) in our simplified example. FIGS. 28-29 show the correspondingauxiliary tables;

[0229] the macroword thesauruses organized like the individual wordthesauruses, with a specified truncation or prefix length. The indexregister of each macroword thesaurus further has an additional columncontaining, for each macroword, the row-ID, in the index register of thethesaurus of lower level for the same attribute, of the first word (ormacroword) covered by the macroword. FIGS. 27 and 30 show the indexregister and auxiliary table for the attributes AT=CN and the truncationlength 3;

[0230] the coding data container(s) for each coding layer, each having avariable head address for its free record chain. FIGS. 31 and 32 showlayer 1 and layer 2 data containers shared by the thesauruses of FIGS.24-29 (free record chain head addresses 30 and 27, respectively);

[0231] optionally, one or more thesauruses stored in a “low density”format suitable for attributes of high cardinality. In the low densityformat, n=0 and the flat file row-ID's are stored as explicit (short)integer lists, for example by means of record chains. If the coding datafor layers 1 through n are needed, they are easily calculated byperforming n successive Euclidean divisions from each stored integer ofthe list. For a given high cardinality attribute, it may be appropriateto provide an individual word thesaurus in the low density format andone or more macroword thesauruses in the “normal” encoded format.

[0232] The data containers of FIGS. 31 and 32 are derived from thoseshown in FIGS. 15 and 16 pursuant to the rearrangement stage 125, inwhich the auxiliary tables of FIGS. 28-30 are also obtained from thoseof FIGS. 14A, 14G and 14H, respectively. For treating a query concerningthe client called André, the processor would have to read records #20,#11 and #2 of FIG. 15 (limiting ourselves to layer 1) if the coding datacontainer were not rearranged, whereas it reads the physicallycontiguous records #1, #2 and #3 of the rearranged container of FIG. 31.The latter reading can typically be done by loading a single block intothe cache memory. More disc accesses, and hence a longer processingtime, are required for reading scattered records.

[0233] The columns separated by broken lines in FIGS. 9 and 28-32 arepreferably stored separately. For example, the storage address of onevalue in such a column may be defined as a start address assigned to thecolumn plus an offset equal to its row-ID multiplied by a field length.

[0234] Accordingly, the links of a link table row (data graph) arestored at corresponding addresses given by the flat file row-ID. Thisseparate storage of the link column accelerates the data retrieval whensome of the data tables need not be accessed to produce the output datarequested in the query.

[0235] Likewise, some elementary operations performed in the queryprocessing require only coding data for one layer, so that it isadvantageous to separate the information concerning each layer in theauxiliary tables to accelerate the processing. Other operations implythe rank data and the bitmap segment data independently, so that it isadvantageous to separate those data in the data containers as shown inFIGS. 31 and 32.

[0236] In an alternative way of storing a thesaurus, the word indexregister and the auxiliary table are merged in a single table with aHuffman type of indexing: each row of that table contains a value ofattribute AT, the AT_Fk and AT_Lk data, a next row pointer (the next rowcontains the next value of the attribute in the sorted thesaurus) andoptionally a preceding row pointer.

[0237] In an embodiment, the maintenance of VDG's created as describedhereabove may involve the following operations:

[0238] 1/ Record Insertion

[0239] A new virtual data graph, i.e. a new row in the flat file, isgenerally generated in response to the insertion of a new record in adata table.

[0240] However, if the new record has a link to an existing record ofanother target table such that no other link points to said existingrecord, then there is no need for a new data graph, but for the updateof an existing data graph. For example, if client Oscar subscribes afirst policy, e.g. for his car, a new record is added to the policy datatable without creating any new VDG: the data graph of FIG. 7 is simplymodified to place the new data in the node corresponding to the policytable. If Oscar then subscribes a second policy, e.g. for his house, anew VDG will be necessary.

[0241] To generate the new VDG, all records from the other data tables,related to the new inserted record, including Null records, areidentified by their respective row-ID's which, if necessary, can beretrieved by queries based on attribute values of those related records.

[0242] After appending the new record to the data table, the first thingto do is to initialize any new thesaurus entry which may be necessary ifnew attribute values occur (all AT_Fk and AT_Lk fields are initializedto 0). The new virtual flat file row and its corresponding thesaurusentries may be generated as in steps 133-142 of FIG. 20. Any higherlevel macroword thesaurus is updated accordingly.

[0243] 2/ Record Attribute Modification

[0244] Changing or adding an attribute value in an already existing datatable record has no effect on the link table which does not reflect thetable contents but the link structure. Adding is a particular case ofchanging when the preceding attribute value was Null. Likewise, deletingan attribute value from a record is a particular case of changing whenthe new attribute value is Null.

[0245] If the new attribute value requires a new thesaurus entry, suchentry is initialized (AT_Fk=AT_Lk=0). The list L of the link tablerow-ID's corresponding to flat file records comprising the data recordto be amended is obtained by placing a suitable query. The latter list Lis merged (bitmap OR operation) with the flat file row-ID list L′ of thenew attribute value, and the coding data of the merged list L

L′ are assigned to the new attribute value. The complement {overscore(L)} of list L is also determined (bitmap NOT operation) to beintersected (ANDed) with the flat file row-ID list L″ of the precedingattribute value. If the resulting intersection list {overscore (L)}

L″ is not empty, its coding data are assigned to the preceding attributevalue. This may transfer to the free record chain of one or more datacontainers records that previously belonged to the record chainassociated with the preceding attribute value. If the intersection list{overscore (L)}

L″ is empty, the preceding attribute value may be deleted from its wordthesaurus. The same intersection and update sequence is performed forany higher level macroword thesaurus.

[0246] 3/ Record Link Modification

[0247] Changing a link in a source data table leads to correspondingchanges in every occurrence of the link in the link table. The list L ofthe concerned link table rows can be determined by processing a suitablequery.

[0248] If the target table record pointed to by the former link has nomore link pointing thereto (its row-ID does not occur any more in thecorresponding column of the link table after the modification), a newVDG is generated. Downstream of the modified link, this new VDG has thesame content as the one(s) which is (are) being amended. Upstream of themodified link, it consists of Null records. The new virtual flat filerow and its corresponding thesaurus entries may be generated as in steps133-142 of FIG. 20. Any higher level macroword thesaurus is updatedaccordingly.

[0249] After that, a procedure similar to the one described in thepreceding section can be performed for each attribute of the targettable: /a/ the list L is merged with the flat file row-ID list L′ of thenew attribute value (the value occurring in the target table recordpointed to by the new link); /b/ the coding data of the merged list L

L′ are assigned to the new attribute value; /c/ the complement{overscore (L)} of list L is intersected with the flat file row-ID listL″ of the preceding attribute value (the value occurring in the targettable record pointed to by the former link); /d/ the coding data of theresulting intersection list {overscore (L)}

L″ are assigned to the preceding attribute value; and /e/ the sameintersection and update sequence is performed for any higher levelmacroword thesaurus.

[0250] If the first target table (for the modified link) has a linkcolumn to a second target table, the link value stored in the column ofthe link table associated with the second target table and in each rowof list L is also changed, and the above procedure /a/-/e/ is performedfor each attribute of the second target table. This is repeated for anydata table located downstream of the first target table in the datatable tree (FIG. 4).

[0251] For example, if a correction is made in the accident table ofFIG. 3 to indicate that accident #6 was under policy #2 instead ofpolicy #0, i.e. concerned Max's house instead of Ariane's car, the linkfrom the accident table in the data graph of FIG. 5 has to be changed topoint to policy record #2, and the link from the policy table has to bechanged to point to client record #4. A new row is inserted in thevirtual flat file, to contain the useful information about Ariane's carpolicy under which no accident took place. As a result, row #6 of thelink table of FIG. 9 is changed to include the values 4, 2 and 6,respectively, in the client, policy and accident link columns, and a newrow #12 is added including the values 2, 4 and −1, respectively, withcorresponding changes in the thesauruses.

[0252] 4/ Record Cancellation

[0253] Canceling a record from a root table involves deleting the row(s)of the flat file containing that record. The corresponding flat filerow-ID's are removed from the lists encoded in the thesauruses, i.e.zeroes are written at the associated locations of the bitmap vectors.These flat file row-ID's may be made available for further VDGinsertion, for example pursuant to section 1/ or 3/ hereabove. They mayalso remain as blank rows if the virtual flat file size is not a majorconcern. Likewise, canceling a record from a target table which has nolink pointing thereto in the corresponding source table involvesdeleting the row(s) of the flat file containing that record (these rowwere representing data graphs with Null records upstream of thecancelled record).

[0254] If the cancelled record belongs to a target table for acompulsory link (e.g. the client or policy table in our example), anyflat file row containing that record is also deleted. If the cancelledrecord belongs to a target table for an optional link (e.g. the thirdparty or broker table in the example of FIG. 4), the cancellationcomprises a link modification whereby any link pointing to that recordis replaced by a link to a Null record (link value=−1). Suchmodification may be performed as described in the above section 3/ (butwithout generating any new VDG).

[0255] For any link of the cancelled record which pointed to a non-Nulltarget table record whose row-ID does not occur any more in thecorresponding column of the link table, it is necessary to generate anew VDG containing the same data as the cancelled record in anddownstream of said non-Null target table record and Null values in andupstream of the cancelled record. The new virtual flat file row and itscorresponding thesaurus entries may be generated as in steps 133-142 ofFIG. 20. Any higher level macroword thesaurus is updated accordingly.

[0256] 5/ Thesaurus Update and Sorting

[0257] With the above-described structure of the thesaurus entries, thecancellation of a word in a thesaurus, which occurs when its flat filerow-ID list becomes empty, could be done by leaving the thesaurus entrywith zeroes in its HPk data. However, this is not optimal regardingmemory usage.

[0258] A more efficient method is to update the record chains in thedata container, so that the auxiliary table has AT_Fk(WI)=AT_Lk(WI)=0for the entry WI of the cancelled word. In such a case, the word indexWI can be released, a default value (e.g. −1) being written into theword index column for the cancelled word in the thesaurus indexregister.

[0259] The creation of a new word thesaurus entry can be done asillustrated in FIGS. 21-24 (AD=0 in test 151 or 172). The word index WIis obtained by incrementing a counter representing the number ofthesaurus entries, or by selecting an available word index (e.g. whichhas been released previously when canceling another word). In thisprocess, a (useful) row is added to the auxiliary table of thecorresponding attribute, with row-ID=WI.

[0260] Similar procedures can be applied for updating the macrowordthesauruses. A macroword index WI may be released when canceling amacroword (all its constituent words have been cancelled). In the caseof a word creation, it is first checked whether the macroword alreadyexists, in which case its macroword index WI is recovered; otherwise, amacroword is also created.

[0261] It is thus appreciated that, once words have been removed and/oradded, the auxiliary tables are no more sorted in the ascending order ofthe thesaurus words. The word index register has to be manipulated inorder to maintain the thesaurus sorting.

[0262] However, it is not necessary to perform such manipulation of theword index register immediately. This is very advantageous because theupdated database is made available for any new query without requiring asorting operation in the whole thesaurus, which may take some time.

[0263] The newly created words or macroword of a thesaurus can havetheir word indexes stored in a separate, secondary index register,whereas they share the same auxiliary table and coding data containersas the former words of the thesaurus. Only this secondary index registercan be sorted when a thesaurus entry is added, which is a relativelylight job since most of the thesaurus words belong to the primaryregister. When a word is deleted, its row in the primary or secondaryindex register remains with the default value in the word index column.Accordingly, to access the coding data relating to a given word range,the range boundaries are searched, by dichotomy, in both the primary andsecondary index registers to determine the relevant word indexes whichare then used in the usual way to address the common auxiliary table anddata containers.

[0264] From time to time, when the CPU 101 is available, a batch task isrun to merge the primary and secondary index registers while deletingtheir rows having the default value in the word index column. This is astraightforward external sorting operation since both registers arealready sorted. The resulting merged register is saved to replace theprimary register, and the secondary register is cancelled.

[0265] If the secondary word index register becomes too big (i.e. itssorting requires a too long time every time an entry is added) beforesuch merge operation is carried out, it is possible to create a further,tertiary index register to receive the new thesaurus entries, and soforth.

[0266] 6/ Data Container Optimization

[0267] This is useful if the thesaurus organization is of the type shownin FIGS. 25-32 rather than of the type shown in FIG. 17.

[0268] As records are inserted and deleted in a coding data container,the above-mentioned condition that the record chains should preferablybe arranged so that records pertaining to the same thesaurus word havecontiguous addresses is no more fulfilled. This does not prevent thedatabase system from operating satisfactorily. However, in order tooptimize the query processing time, it is preferable to rearrange therecords of the coding data container and the corresponding columns ofthe thesaurus auxiliary table(s) as in the above-described step 125.Like the word index register sorting, such rearrangement can be carriedout when CPU time is available.

Alternative Thesaurus Arrangements

[0269] If the thesauruses are arranged according to the preferredorganization illustrated by FIG. 17, with distinct files for each wordor macroword, the flow charts of FIGS. 19-24 are somewhat simplified.First, stage 125 of FIG. 19 is not performed (it is an advantage of thefile organization to dispense with such sorting when the VDG's arecreated and maintained). In FIG. 20, the dichotomy search 135 and thethesaurus update of step 136 may be replaced by the procedureillustrated in FIG. 33.

[0270] In this procedure, imax(AT, W, k) designates the current numberof layer k records in the coding data file relating to thesaurus AT andword W. These parameters are set to zero for all values of AT, W and kat the initialization step 130.

[0271] The value in the current data graph of the attribute AT selectedin step 134 of FIG. 20 is allocated to the variable W in step 175 ofFIG. 33, and the coding layer index k is initialized to 1. The integeri, which points to the records of the coding data file is first set tozero in step 176. If i=imax(AT, W, k) in the following test 177, arecord AT_W NOk(i) having the value Qk is appended to the layer k rankfile pertaining to word W and a record AT_W_HPk(i) having the all-zerovalue is appended to the corresponding bitmap segment file. This is donein step 178, where imax(AT, W, k) is also incremented by one unit. Ifi<imax(AT, W, k) in test 177, the rank AT_W_NOk(i) is loaded into therank variable q in step 179. If the following test 180 shows that q isdifferent from the quotient variable Qk, the integer i is incremented byone unit in step 181 and the process comes back to step 177 forexamining the next rank variable of the file, if any. Accordingly, thescanning of the coding data record chain for each layer k (correspondingto loop 153-156 in FIG. 21) is performed within the AT_W_NOk file whichis smaller than the data container common to all words of the thesaurus.Therefore, the minimum number of disc accesses is ensured.

[0272] After step 178, or when q=Qk in test 180, a “1” is written intothe bit of rank Rk of the bitmap segment AT_W_HPk(i) in the relevantcoding data file (step 182). The coding layer index k is compared with n(or to a lower value if the higher layer coding data are calculatedafterwards) in test 183. If k<n, the index k is incremented by one unitin step 184 before coming back to step 176. When k=n, the thesaurusupdate is over and the program proceeds to step 137 of FIG. 20.

[0273] In the procedure of FIG. 33, the rank data AT_W_NOk(i), eachconsisting of an integer value can be read in large blocks from the harddrive 105 to the cache memory, so that the procedure is very quick.

[0274] Another option which can be used in the thesauruses is to includein each entry relating to a word an indication of the representationformat of the flat file row-ID list. Indeed, the format (e.g. low ornormal density) can be chosen word by word depending on the number ofdata graphs including the word under consideration. This is illustratedin broken lines in the right part of FIGS. 25-27 in the case where thereare only two formats, i.e. low density (0) and normal density with n=2coding layers (1). In the example, all the thesaurus entries are in thenormal density format. There could be more than two formats; forexample, the format data in the thesaurus could specify the number ofcoding layers for each word. When the flat file row-ID list arerepresented by data stored in data containers common to one or morethesauruses, distinct containers are provided for the different codingformats.

[0275] When the above option is used, the format for each thesaurusentry can be modified as the database lives, in order to optimize thestorage. This is a low priority task since the query engine can workwith any format. For example, when thesaurus entries are being updated,it is possible to mark any entry for which a format change appears to bedesirable, based on predefined conditions fulfilled by the density ofthe word in the amended database. For example, a word or macroword couldbe changed from low to normal density format when a certain number ofdata graphs are identified in its thesaurus entry, and vice versa.Afterwards, when processor time is available, the marked entries can betranslated into the new format to optimize the system.

[0276] It has been mentioned above that, when n>1, storing the rank datain every coding layer is somewhat redundant, since the flat file row-IDlists are completely defined by the bitmap segment data in all layersand the rank data in the last layer.

[0277] FIGS. 34A-B illustrate an alternative way of arranging the codingdata files, which avoids storing the layer k ranks with k<n. In thisarrangement, it is sufficient that the auxiliary tables (FIGS. 28-30)point to a first record in the layer n data container: the addressesAT_F1 and AT_L1 are not necessary. The data container of the highestlayer n=2, shown in FIG. 34A, is the same as that of FIG. 32, with anadditional field in each record to contain the head addressF(n−1)(AD)=F1(AD) of a record chain in the data container of the lowerlayer n−1=1. The latter data container (FIG. 34B) has one record chainfor each layer n rank pertaining to each thesaurus entry covered by thedata container. Each record of a layer k<n data container comprises afirst field for containing the address NXk(AD) of the next record of thechain (this address is 0 if there is no further address), and a secondfield for containing the corresponding bitmap segment HPk(AD). The layerk<n chain is ordered in accordance with the non-zero bits of the bitmapsegment HP(k+1) stored in the record of the upper layer data containerwhich contains the head address of the chain. If 1<k<n (not shown), therecord further has a third field for containing the head address of arecord chain in the data container of the lower layer k−1 (and so forthuntil k=1).

[0278] The procedure for retrieving a flat file row-ID list from athesaurus pointing to data containers of FIGS. 34A-B may be as follows.The word index WI is used to obtain the address of the first relevantrecord in the layer 2 data container. For this address (and then foreach address of the chain defined by the NX2 field), the layer 2 rankNO2 is read and the bitmap segment HP2 is scanned. Every time a “1” isfound in this scanning, at a bit position R2, a layer 1 rankNO1=NO2×D2+R2 is determined and a corresponding record of the lowerlayer data container is read (the first time at the head address givenby the column F1 in the layer 2 data container, and then at theaddresses pointed to by the NX1 addresses in the layer 1 datacontainer). By this method the layer 1 bitmap segments HP1 and theirpositions NO1 are retrieved to assemble the bitmap vector representingthe desired flat file row-ID list.

[0279] In the general case, the data containers are accessed from layern. Each segment HPk read after determining a rank NOk with k>1 isscanned to locate its non-zero bits. Each non-zero bit of HPk located ina position Rk provides a lower layer rank NO(k−1)=NOk×Dk+Rk, and acorresponding bitmap segment HP(k−1) is read in the chain designated inthe lower layer container. The process is repeated recursively untilk=1: the numbers NO1×D1+R1 are the flat file row-ID's for the thesaurusentry.

[0280] The coding data files illustrated in FIGS. 34A-B can be createdby a method similar to that described with reference to FIGS. 19-21. Allthe HPk and F(k−1) fields are initialized with zeroes before stage 124.The procedure of FIG. 21 is executed only for k=n, with step 161replaced by the loop depicted in FIG. 35 in which the coding layer indexk decreases from n to 1.

[0281] The first step 450 of this loop consists in writing the digit “1”at bit position Rk of the bitmap segment HPk(AD). If the coding layer kis greater than 1 (test 451), it is decremented by one unit in step 452,and the first address M=Fk(AD) is read in the layer (k+1) coding datacontainer (step 453).

[0282] If M is zero (test 454), the head address Hk of the free recordchain in the layer k coding data container is written into the firstaddress field Fk(AD) of the layer (k+1) coding data container (step455), to create a new chain. The value of AD is then replaced by Hk(step 456), and the record chains are updated in the layer k coding datacontainer (steps 457-458): Hk is replaced by NXk(AD) before NXk(AD) isset to 0. After step 458, the process loops back to step 450.

[0283] If M>0 in test 454, the index R is set to 0 in step 460 toinitialize the scanning of the bitmap segment HP=HP(k+1)(AD). If R issmaller than the remainder R(k+1) corresponding to the current datagraph identifier, the corresponding bit HP(R) of the bitmap segment HPis evaluated (test 462). If HP(R)=0, the program proceeds to step 463for incrementing R by one unit before coming back to test 461. WhenHP(R)=1 in test 462, it is necessary to move forward in the layer krecord chain: the integer M′ receives the value of M in step 464, and Mis replaced by NXk(M′) in step 465. If the new value of M is not zero(test 466), the program proceeds to the above-mentioned step 463.Otherwise, the end of the layer k record chain is reached, so that thehead address Hk of the layer k free record chain is assigned to NXk (M′)in step 467 before proceeding to the above-mentioned step 456.

[0284] If R is equal to the remainder R(k+1) in test 461, thecorresponding bit HP(R) of the bitmap segment HP is also evaluated (test470). If HP(R)=1, the rank Qk already exists in the layer k+1 input listrelating to the current thesaurus entry, so that it is not necessary tocreate a new record in the layer k coding data container: the value ofAD is simply replaced by M in step 471, and the process loops back tostep 450.

[0285] If HP(R)=0 in test 470, the value of AD is replaced by the headaddress Hk of the free record chain (step 472), and the Huffman-typerecord chains are updated in the layer k coding data container (steps473-474): Hk is replaced by NXk(AD) before NXk(AD) is set to M. Afterstep 474, the process loops back to step 450.

[0286] The loop of FIG. 35 is over when k=1 in test 451.

[0287]FIGS. 34C and 34D show tables whose contents are equivalent tothose of FIGS. 34A and 34B, and in which the bitmap segments HPk for k>1are not explicitly stored. The layer n coding data container (FIG. 34C)is identical to that described with reference to FIG. 34A, but withoutthe HPn column. Each layer k coding data container for k<n (FIG. 34D) isidentical to that described with reference to FIG. 34B, with anadditional column R(k+1) containing layer k+1 remainders. The presenceof a remainder value R(k+1) in a record to the layer k coding datacontainer means that there is a “1” at position R(k+1) in the non-storedhigher layer bitmap segment HP(k+1).

[0288] It will be appreciated that the scheme of FIG. 17, i.e. distinctcoding data files for each thesaurus entry to minimize the discaccesses, is also applicable when the stored coding data do not includethe ranks for layers 1, . . . , n−1. The layer n ranks and bitmapsegments may be stored as in FIG. 17. For the lower layers, there areseveral options. There can be one data container for each thesaurus wordand each coding layer k<n, with record chains pointed to in the recordsrelating the upper layer k+1. The layer k record chains can also beisolated in distinct files whose name include the attribute name AT, theword or macroword value W, the coding layer index k and a layer k+1 rankNO(k+1). Each record of such file AT_W_k_NO(k+1) then contains a layerk+1 remainder R(k+1) and a layer k bitmap segment HPk which is locatedat rank NOk=NO(k+1)×D(k+1)+R(k+1).

Query Criteria Handling

[0289] As in any RDBMS, queries can be expressed in accordance with theStructured Query Language (SQL), which has been adopted as a standard bythe International Standard Organization (ISO) and the American NationalStandard Institute (ANSI).

[0290] A general flow chart of the query processing procedure is shownin FIG. 36.

[0291] The query criteria, contained in the SQL “WHERE” clause, areconverted into a request tree in stage 190 of FIG. 36. The querycriteria are analyzed and structured according to a tree in which theleaves correspond to ranges for respective attributes values as definedin the SQL query and the nodes correspond to logical operations to beperformed from those leaves. The leaves are also referred to as “BETWEENclauses” of the SQL query. An individual attribute value defined in theSQL query is a BETWEEN clause covering a single word.

EXAMPLE 3

[0292] An example of such a tree is shown in FIG. 37 in the illustrativecase of a query which consists in finding all data graphs relating toaccidents undergone by client André or client Max and having a damageamount AA such that 500≦AA≦5000. That tree has three leaves, indicatedby broken lines, corresponding to the BETWEEN clauses defined in thequery: [André, André] and [Max, Max] for the client name attribute and[500, 5000] for the accident amount attribute. The tree also has twonodes, one for the OR operation between the two CN criteria, and one atthe root for the AND operation with the AA criterion.

[0293] The tree decomposition is not unique. The one having the minimumnumber of nodes is preferably selected.

[0294] The next stage 191 is a tree expansion made by analyzing andsplitting the BETWEEN clauses relating to attributes having macrowordthesauruses. This is done from the tree obtained in step 190, withreference to the sorted thesaurus word and macroword index filesassociated with the attributes used in the query. The lower and upperbounds of each range defined in a BETWEEN clause are compared with thewords of the thesaurus associated with the attribute, to find adecomposition of the range into sub-ranges, whereby each sub-range isalso defined as a BETWEEN clause in a word or macroword thesaurus.

[0295] In a preferred embodiment, the decomposition is optimized to makemaximum use of the macrowords. This optimization consists in retainingthe lowest possible number of words or macrowords to form the sub-rangesto be mapped onto the range defined in the BETWEEN clause. The systemselects the highest level macrowords that are included in the interval,and repeats the same process in the remaining parts of the range untilthe atomic word level is attained or the words of the range areexhausted.

[0296] In the expanded tree produced in stage 191, the BETWEEN leaveswhich have been split are replaced by sub-trees made of OR nodes andleaves associated with the sub-ranges. Those leaves are also in the formof BETWEEN clauses, covering thesaurus entries relevant to the query.The expanded tree defines a combination of the relevant thesaurusentries for the subsequent processing.

[0297] All the leaves of the expanded tree are associated withrespective word or macroword (sub-)ranges. Such range may be defined byits bounds in terms of word or macroword row-ID's in the thesaurus indexfile.

[0298]FIG. 38 shows the expanded tree corresponding to the tree ofExample 3 (FIG. 37). It is obtained by means of the thesaurus indexfiles of FIGS. 25-27. The one-word ranges “CN =André” and “CN=Max” arenot split, but simply encoded by the row-ID's CN_x=1 and 4 of the wordsin the thesaurus index file, obtained by dichotomic searches. Anothersearch in the accident amount thesauruses of FIGS. 26 and 27 leads tosplitting the range 500≦AA≦5000 into three sub-ranges, one for theindividual words AA_x=2 and 5, and one for the macroword AA_(—)3_x=1.

[0299]FIG. 39 shows a flow chart of an optimal procedure for splitting aBETWEEN clause in stage 191 of FIG. 36. It is assumed that the(connected) range does not include the Null value (otherwise the leafcan be first split into two substitute leaves linked by an OR node, oneleaf with the individual word row-ID AT_x=0, and the other satisfyingthe above assumption).

[0300] It is also assumed that the attribute AT considered in theBETWEEN clause has a number Q≧0 of macroword thesauruses indexed by aninteger level parameter q with 1≦q≦Q, the level q=0 designating theindividual word thesaurus. For a level q thesaurus, the prefix length(e.g. number of ASCII characters) is noted P(q), with P(0)>P(1)> . .. >P(Q). P(0) is the individual word length. In FIGS. 39-40, x_(max)designates the number of non-Null words in thesaurus 0, W_(q)(x)designates the (macro)word stored at row-ID=x in the level q thesaurus,and [W]_(P(q)) designates the macroword obtained by truncating a word Wto keep its prefix of length P(q), for q≧1.

[0301] In the initial step 200 of the procedure of FIG. 39, the programselects the word thesaurus row-ID's a and b such that W₀(a) and W₀(b)are respectively the lowest and highest thesaurus words included in therange defined for the leaf being processed. The integers a and b arereadily selected by dichotomic searches in the word thesaurus based onthe range bounds. If the search shows that the range covers no thesaurusword, the procedure is terminated by specifying that the leaf outputwill be an empty flat file row-ID list.

[0302] If W₀(a) is the lowest word of the thesaurus (a=1 in test 201),the binary variable XL is initialized as XL=0 in step 202. Otherwise, itis initialized as XL=1 in step 203. If W₀(b) is the highest word of thethesaurus (b=x_(max) in test 204), the binary variable XR is initializedas XR=0 in step 205. Otherwise, it is initialized as XR=1 in step 206.In the following steps, the value XL(XR)=0 denotes the fact that thelower (upper) bound of the range under consideration is aligned with amacroword boundary. If it is aligned with a macroword boundary from alevel q thesaurus, then this is also true for any level q′ thesauruswith 1≦q′≦q. The initialization 201-206 is valid for q=Q.

[0303] In step 207, the program invokes a function FUNC whose flow chartis represented in FIG. 40. This function returns data describing asub-tree to be inserted in the place of the processed leaf (step 208).The function FUNC has six arguments input when starting its execution instep 210 of FIG. 40A: the attribute reference AT; a thesaurus levelparameter q (q=Q when the function is first invoked in step 207 of FIG.39); the thesaurus row-ID's a and b of the lowest and highest AT wordsin the range of interest; and the above-defined variables XL and XR.

[0304] After step 210, it is determined whether the thesaurus levelparameter q is zero (test 211). If q>0, two macroword thesaurus row-ID'sa′ and b′ are selected in step 212, such that W_(q)(a′)=[W₀(a)]_(P(q))and W_(q)(b′)=[W₀(b)]_(P(q)). This is done by simple dichotomic searchesin the level q thesaurus after truncating the words W₀(a) and W₀(b).

[0305] In the following test 213, the variable XL is evaluated. If XL=1,it is determined in test 214 whether the consecutive words W₀(a−1) andW₀(a) share the same level q macroword, i.e. whether[W₀(a−1)]p(q)=W_(q)(a′). If so, the integer a′ is increased by one unitin step 215. If [W₀(a−1)]_(P(q))<W_(q)(a′) in test 214, the value of XLis changed to 0 in step 216 since the lower bound of the range underconsideration is aligned with a level q macroword boundary. After step215 or 216, or when XL=0 in test 213, the variable XR is evaluated (test217). If XR=1, it is determined in test 218 whether the consecutivewords W₀(b) and W₀(b+1) share the same level q macroword, i.e. whether[W₀(b+1)]_(P(q))=W_(q)(b′). If so, the integer b′ is decreased by oneunit in step 219. If [W₀(b+1)]_(P(q))>W_(q)(b′) in test 218, the valueof XR is changed to 0 in step 216 since the upper bound of the rangeunder consideration is aligned with a level q macroword boundary.

[0306] After step 219 or 220, or when XR=0 in test 217, the variables a′and b′ are compared in test 221. If a′>b′, no level q macroword isspanned by the range under consideration, the program decrements q byone unit in step 222 and comes back to step 211.

[0307] When a′≦b′ in test 221, a sub-range of b′−a′+1 macrowords isgenerated for insertion into the expanded query tree (step 223 in FIG.40B). This sub-range covers the macroword row-ID's from AT_P(q)_x=a′ toAT_P(q)_x=b′.

[0308] Afterwards, the variable XL is evaluated again in step 224. IfXL=1, another range has to be considered, below the sub-range generatedin step 223. In step 225, the row-ID b″ of the upper bound of that lowerrange is determined: the corresponding word W₀(b″) is the highest of theAT thesaurus such that [W₀(b″)]_(P(q))<W_(q)(a′). The function FUNC(AT,q−1, a, b″, 1, 0) is then called recursively in step 226, to deal withthe additional lower range. After step 226, or when XL=0 in test 224,the variable XR is evaluated again in step 227. If XR=1, another rangehas to be considered, above the sub-range generated in step 223. In step228, the row-ID a″ of the lower bound of that upper range is determined:the corresponding word W₀(a″) is the lowest of the AT thesaurus suchthat [W₀(a″)]_(P(q))>W_(q)(b′). The function FUNC(AT, q−1, a″, b, 0, 1)is then called recursively in step 229, to deal with the additionalupper range.

[0309] When q=0 in test 211, a sub-range of b−a+1 words is generated forinsertion into the expanded query tree (step 230). This sub-range coversthe individual word row-ID's from AT_x=a to AT_x=b.

[0310] After step 229 or 230, or when XR=0 in test 227, the execution ofthe function FUNC is terminated in step 231 by returning the datadescribing the sub-tree, which have been generated in step 223 or 230and/or which have been returned by the function recursively called insteps 226 and/or 229.

[0311] Once the stage 191 of analyzing and expanding the query tree iscompleted, the expanded tree is processed in stage 192 of FIG. 36,starting from the highest coding layer n. If n>1, the processing isperformed successively in the layers k, with k decreasing from n to 1,as shown in the loop represented in FIG. 41.

[0312] The coding layer index k is initialized with the value n in step240 of FIG. 41. The layer k processing is started in step 241 byselecting the root ND of the expanded query tree as a node for calling afunction named FNODE (step 242). The input to this function comprise thecoding layer index k, the parameters describing node ND and its childrennodes, and a bitmap vector Res (initialized in an arbitrary manner fork=n). Its output is a bitmap vector noted WZ. In layer 1, the bits ofvalue 1 of the output bitmap vector WZ indicate the VDG's (flat filerow-ID's) matching the query criteria defined by the tree whose root isnode ND. In layer k>1, they indicate the respective layer k−1 ranks ofthe groups of Δk flat file row-ID's which include at least one matchingflat file row-ID. In each coding layer index k, the function FNODE iscalled recursively to process all the nodes of the expanded query tree.

[0313] The bitmap vector WZ output by the function called in step 242 issaved as the layer k query result Res in step 243, to be used in thesubsequent layer k−1 processing if k>1. If so (test 244), the index k isdecremented by one unit in step 245, and the next layer processing isstarted from step 241.

[0314] For k=1, Res is the bitmap representation of the desired flatfile row-ID list, output in step 246.

[0315] A flow chart of function FNODE is shown in FIG. 42. The bitmapvector WZ is considered there as a succession of segments of Dk bits.The segment of rank N of vector WZ (i.e. the (N+1)-th segment with N≧0)is noted WZ[N]. The bit of rank N of vector WZ (i.e. the (N+1)-th bitwith N≧0) is noted WZ(N). After the function is started (step 248), aworking zone is reserved in RAM 103 for containing the bitmap vector WZ(step 249).

[0316] In test 250, it is first determined whether ND designates apreset node. A preset node (not illustrated in the example of FIG. 38)is a node for which a flat file row-ID list has already been determined.Typically, that list has been produced as a matching data graphidentifier list in the processing of a previous query (output of step192). It may also be a combination of such matching identifier lists.One or more preset nodes can be defined in the conversion step 190 whenthe SQL query refers to the results of one or more previous queries, forexample to restrict the response to records which were included in theresponse to the previous queries. This feature is particularly usefulwhen the database is used in interactive mode.

[0317] The flat file row-ID list previously determined for a preset nodecan be stored in RAM 103 or saved in hard drive 105 (preferably incompressed form in the latter case). That list is encoded according tothe n coding layers to provide layer k input lists in the form of bitmapvectors for 1≦k≦n. Such layer k bitmap vector is loaded as WZ in step251 when test 250 reveals that the node ND is preset.

[0318] Otherwise, if ND does not designate a leaf but an operator node(test 252), its first child node ND1 is selected in step 253, and thefunction FNODE is called recursively in step 254 to obtain the bitmapvector WZ1 corresponding to node ND1. The second child node ND2 of theoperator node ND is then selected in step 255, and the function FNODE iscalled again in step 256 to obtain the bitmap vector WZ2 correspondingto node ND2.

[0319] In step 257, the bitmap vectors WZ1 and WZ2 are combined bitwiseto form the bitmap vector WZ. The combination (WZ(N)=WZ1(N){circle over(x)}WZ(N) for any N) is in accordance with the Boolean operator {circleover (x)} described in the parameters of node ND, e.g. AND, OR,Exclusive OR, etc. operation. It is essentially a superposition ofbitmap vectors, which is performed very quickly since both operandvectors are stored in RAM 103. In step 258, the RAM space which has beenallocated to working zones WZ1 and WZ2 is released. In FIG. 42, the casewhere the operator node has two child nodes is only considered. Clearlyit can be extended to the case where there are more than two operands.Moreover, some operations may involve a single operand, such as the NOToperation, so that the function FNODE may be call only once.

[0320] When node ND is a leaf (test 252), all the bits of the workingzone WZ are set to zero in the initialization step 260. In addition, thethesaurus pointer x is initialized to the value x1 of the first row-IDof the BETWEEN range defined for node ND.

[0321] If node ND relates to an attribute AT and macroword index q forwhich the thesaurus is stored in the “low density” format (test 261),the leaf processing is as described below with reference to FIG. 43(step 262) to obtain the relevant bitmap vector WZ. If the thesaurusformat is “normal density”, the processing depends on whether theprogram is in the (chronologically) first layer, that is k=n (test 263).The processing of FIG. 44 is applied if k=n (step 264), and that of FIG.45 if k<n (step 265).

[0322] After step 251, 258, 262, 264 or 265, the execution of functionFNODE is terminated in step 266 by returning the bitmap vector WZ.

[0323] For explaining the low density processing, we assume in FIG. 43that the thesaurus storage also makes use of record chains: thethesaurus has an index file similar to those of FIGS. 25-27 (the wordindex stored at row-ID x being noted AT_WI(x)) and an auxiliary tableaddressed by the word indexes and containing the addresses AT_F(WI) in adata container of the first flat file row-ID's of the record chains. Ineach record of address AD>0, this data container has, in addition to aflat file row-ID value NO(AD), a next address field for containing apointer to the next address NX(AD) of the record chain. The chain tailhas NX(AD)=0. Alternatively, the low density lists could be stored inindividual files for each word (similarly to FIG. 17).

[0324] The low density processing of FIG. 43 has a loop in which thewords of the BETWEEN range are successively handled. In each iteration,the program first obtains the word index WI=AT_WI(x) in step 270, andthen the head address AD=AT_F(WI) in step 271 to initiate the scanningof the record chain. If AD>0 (test 272), there remains at least one itemto be examined in the record chain, so that the flat file row-ID valueNO(AD) and the next address NX(AD) are read as variables N and M,respectively, in step 273. The k−1 Euclidean division of N by${\Delta \quad k} = {\prod\limits_{k^{\prime} = 1}^{k - 1}\quad {{Dk}^{\prime}\quad \left( {{\Delta 1} = 1} \right)}}$

[0325] is made in step 274 to obtain the layer k−1 quotient (rank) N′.For k=1, N′=N. For k>1, this operation 274 is simply a deletion of the$\sum\limits_{k^{\prime} = 1}^{k - 1}{\delta k}^{\prime}$

[0326] least significant bits of N (remainder) if the layer k′ divisorsDk′ are 2^(δk′) with δk′ integer (1≦k′<k). A “1” is then written intobit WZ(N′) of the bitmap vector WZ (step 275). The next address M issubstituted for AD in step 276 before coming back to the test 272. Whenthe record chain has been completely examined (AD=0 in test 272), it isdetermined whether the current word x is the last one x2 of the BETWEENrange (test 277). If x<x2, the thesaurus pointer x is incremented by oneunit in step 278 for the next iteration of the loop. The loop is overwhen x=x2 in test 277, and the program proceeds to step 266 of FIG. 42.

[0327] The layer n normal density processing of FIG. 44 has a similarloop in which the words or macrowords of the BETWEEN range aresuccessively handled, but without recalculating the (stored) codingdata. In each iteration, the program first obtains the word indexWI=AT_P(q)_WI(x) in step 280, and then the head addressAD=AT_P(q)_Fn(WI) in step 281 to initiate the scanning of the recordchain. If AD>0 (test 282), there remains at least one item to beexamined in the record chain, so that the layer n rank value NOn(AD),the next address NXn(AD) and the corresponding layer n bitmap segmentHPn(AD) are read as variables N, M and H, respectively, in step 283. Thebitmap segment H is then superimposed, by an Boolean OR operation, ontothe segment WZ[N] of bitmap vector WZ (step 284), and M is substitutedfor AD in step 285 before coming back to test 282. When the record chainhas been completely examined (AD=0 in test 282), it is determinedwhether the current word x is the last one x2 of the BETWEEN range (test286). If x<x2, the thesaurus pointer x is incremented by one unit instep 287 for the next iteration of the loop. The loop is over when x=x2in test 286, and the program proceeds to step 266 of FIG. 42.

[0328] The layer k<n normal density processing is detailed in FIG. 45 inthe case where the thesauruses are arranged as illustrated in FIGS.25-32. It takes advantage of the fact that, even where NOk(AD) belongsto a layer k rank list associated with a word or macroword of theBETWEEN range, it is useless to access the bitmap segment HPk(AD) ifthere is a zero in the bit of rank NOk(AD) of the bitmap vector Resobtained in the preceding layer k+1.

[0329] The procedure of FIG. 45 is comparable to that of FIG. 44. Steps280-282 and 285-287 are the same with k substituted for n. However, whena record chain is to be examined (AD>0 in test 282), only the layer krank value NOk(AD) and the next address NXk(AD) are read as variables Nand M in step 290. The bit Res(N) of the layer k+1 result bitmap Res isthen evaluated in test 291. If Res(N)=0, the rank N is filtered out byjumping directly to step 285. Otherwise (Res(N)=1), the bitmap segmentHPk(AD) is read in step 293 before proceeding to step 284.

[0330] With the arrangement of the thesaurus entry coding data, it isnoted that the loops of FIGS. 44 and 45 will generally imply thesuccessive reading of contiguous data container records (steps 283 and290), because each word of index WI has its coding data stored atconsecutive addresses AD in the data container, as well as mostconsecutive words of the BETWEEN range. Therefore, those loops can beexecuted efficiently by loading blocks of data container records bymeans of the computer cache memory, thereby reducing the required numberof disc accesses. The same consideration applies to the low density dataNO(AD) and NX(AD) read in step 273 of FIG. 43.

[0331] A further improvement is obtained with the layer k<n normaldensity processing shown in FIG. 46, which is made of two successiveloops. The first loop, indexed by the thesaurus pointer x, is fordetermining a temporary rank table noted TNO, which is used to handlethe bitmap segments in the second loop. Table TNO has a number ofaddresses which is at least equal to the number of addresses ADmax ofthe data container in which the layer k coding data of the currentthesaurus (AT, q) are stored. Each entry TNO(AD) of address AD in therank table TNO is for containing an integer representing the rankNOk(AD) if it is useful to access the bitmap segment HPk(AD), or else adefault value (−1).

[0332] In the initialization step 279, all entries of the rank table TNOare set to the default value −1. The first loop is comparable to that ofFIG. 45. When Res(N)=1 in test 291, the rank N is written at address ADinto table TNO in step 295 before substituting M for AD in step 285.

[0333] When the first loop is over (x=x2 in test 286), the programproceeds to the second loop which is initialized with AD=1 in step 301.In each iteration of the second loop, the contents N of the rank tableTNO at address AD, read in step 302, are compared with the default valuein test 303. If N is a valid rank value (≠−1), the bitmap segmentHPk(AD) is read (step 304) and superimposed, by a bitwise Boolean ORoperation, onto the segment WZ[N] of the bitmap vector WZ (step 305). IfAD<ADmax (test 306), the rank table address AD is incremented by oneunit in step 307 before coming back to step 302. The second loop is overwhen AD=ADmax in test 306, and the program proceeds to step 266 of FIG.42.

[0334] In addition to filtering out the bitmap segments HPk(AD) that arenot worth reading, the procedure illustrated by FIG. 46, owing to therank table TNO, groups the read operations in the file containing thelayer k bitmap segment data based on the address AD (step 304 in thesecond loop). Such grouping is not only done word by word but for allwords of the BETWEEN range: when the HPk file is eventually read in thesecond loop, no more distinction is made between the words for which arank value has been written into table TNO. This takes maximum advantageof the blockwise access to the HPk file, and provides a very significantadvantage because the lower layers, especially layer 1, imply thelargest HPk files and the highest numbers of read operations therein.

[0335]FIG. 47 shows how the procedure of FIG. 45 can be adapted when thecoding data containers are stored as illustrated in FIGS. 25-30 and34A-B. The loop has a similar structure. However, since the coding dataare accessed from the highest layer n, the address AD read in step 281is the head address AT_P(q)_Fn(WI) of the record chain in the layer ndata container, and when AD>0 in step 282, the rank value NOn(AD) andnext address NXn(AD) read as variables N and M in step 296 also relateto layer n. After step 296, a filtering function FILT is called in step297 before substituting M for AD in step 285.

[0336] A flow chart of this function FILT is shown in FIG. 48. Itsarguments, input when starting its execution in step 500, are as follows(in addition to the attribute name and macroword level which areimplicit in FIGS. 47-48):

[0337] a first coding layer index k, corresponding to the first argumentof the function FNODE called in step 242 of FIG. 41;

[0338] a second coding layer index k′>k, with k′=n when the functionFILT is called in step 297 of FIG. 47;

[0339] k′−k bitmap vectors Res_(k+1), Res_(k+2), . . . , Res_(k′), whereRes_(k+1) is the layer k+1 query result Res. If k′>k+1, Res_(k+2), . . ., Res_(k), are the bitmap vectors obtained, in step 243 of FIG. 41, byencoding Res in the higher layers;

[0340] a layer k′ rank N, with N=NOn(AD) when the function FILT iscalled in step 297 of FIG. 47;

[0341] the corresponding record address AD in the layer k′ datacontainer; and

[0342] the bitmap vector WZ which is being calculated.

[0343] In test 501, it is determined whether the (N+1)-th segment of thebitmap vector Res_(k), is only made of zeroes. If so, it is notnecessary to read any further coding data relating to the layer k′ rankN, so that the execution of the function is terminated in step 502 byreturning the bitmap vector WZ.

[0344] If the segment Res_(k),[N] has at least one “1” in test 501, thebitmap segment HPk′(AD) is read as segment variable H in step 503, andthe intersection segment H AND Res_(k),[N] is evaluated in test 504. Ifthis intersection segment is only made of zeroes, it is also useless toread any further coding data, and the program directly proceeds to step502.

[0345] If test 504 reveals that HAND Res_(k),[N] has at least one “1”,it is necessary to get into the lower layer record chain. Its headaddress F(k′−1)(AD) is read as variable AD′ in step 505, while the layerk′ remainder R is initialized to 0 and the layer k′−1 rank N′ isinitialized to N×Dk′. The bitmap segment H=HPk′(AD) is scanned in a loopin which its bits H(R) are successively examined (test 506) to ascertainwhether the rank N′=N×Dk′+R should be regarded. If H(R)=0, the rank N′is not in the layer k′ coding data of the current thesaurus entry, sothat it is disregarded: R is incremented by one unit in step 507 and ifthe new R is still smaller than Dk′ (test 508), N′ is also incrementedby one unit in step 509 before proceeding to the next iteration fromtest 506.

[0346] If H(R)=1 in test 506, the (N′+1)-th bit of the vector Res_(k),is examined in test 510 to determine whether the layer k′−1 rank N′ hasbeen filtered out in the higher layer processing. If so(Res_(k),(N′)=0), the program jumps to the next position in the layerk′−1 record chain by replacing AD′ by the next address NX(k′−1)(AD′) instep 511. After step 511, the program proceeds to the above-describedstep 507.

[0347] If Res_(k),(N′)=1 in test 510, the processing depends on whetherthe coding layer k′ is immediately above k (test 512). If k′=k+1, thebitmap segment HPk(AD′) is read (step 513) and superimposed, by abitwise Boolean OR operation, onto the segment WZ[N′] of the bitmapvector WZ (step 514). If k′>k+1 in test 512, the recursive function FILTis called in step 515 with the arguments k, k′−1, Res_(k+1), . . . ,Res_(k′−1), N′, AD′ and WZ. After step 514 or 515, the program proceedsto the above-described step 511.

[0348] The scanning of the bitmap segment H=HPk′(AD) is over when R=Dk′in test 508. The updated bitmap vector WZ is then returned in step 502.

[0349] When the coding data containers are arranged as illustrated inFIGS. 34C-D, the scanning of the layer k′ bitmap segment in loop 505-509is replaced by the scanning of the layer k′ remainders in the recordchain of the layer k′−1 coding data container.

[0350] The procedure of FIGS. 47-48 has the advantage that the lowerlayer record chains are accessed only when it is strictly necessary. Inparticular, it is noted that the loop 282-285 of FIG. 45 requires thereading of all the layer k ranks (step 290) relating to the currentthesaurus entry while it may be already known from the k+1 processingthat some ranks will be disregarded (Res(N)=0 in test 291). When thisoccurs in FIGS. 47-48, the rank N is not read in the hard drive (it isnot even stored). This advantage is very significant since the lowerlayers, particularly layer 1, have the largest coding data containers,so that plenty of useless read operation are avoided.

[0351] It is noted that the use of a rank table TNO according to FIG. 46is quite compatible with the procedure of FIGS. 47-48. The first loop280-287 of FIG. 46 is simply replaced by that of FIG. 47, and steps513-514 of FIG. 48 are replaced by writing N′ into TNO(AD′).

[0352] It is noted that the loops of FIGS. 43-47 may cover not only aBETWEEN range in a thesaurus, but generally words and/or macrowordswhose coding data are stored in the same data container, and which arecombined in an OR type of operation. Instead of running the loops fromx=x1 to x=x2, an iteration is made for each one of such word ormacroword.

[0353] For example, if the word and macroword thesauruses for a givenattribute share the same data container, the loop may be executed onlyonce for all relevant values of the attribute, i.e. for the sub-treewhich, in stage 191 of FIG. 36, has been substituted for thecorresponding node of the query tree.

[0354] In addition, such words and/or macrowords may possibly belong todifferent thesauruses (which requires a suitable labeling of the ORnodes of the query tree). For example, if a query aims at the accidentsundergone by a certain client or having a damage amount greater than agiven value, and if the client and accident amount thesauruses share thesame data containers (as in FIGS. 31-32), the client and accident amountattributes may be examined within the same first loop of FIG. 46, andthe TNO table scanned only once to retrieve all the relevant HP1segments.

[0355] However, it is preferable to have one data container for eachthesaurus and each macroword level, as indicated previously. Anadvantage of this is to reduce the sizes of the rank tables TNO used inthe procedure of FIG. 46.

[0356] It is also noted that, when encoding the leaves of the expandedquery tree, it is possible to use the word indexes AT_P(q)_WI(x) insteadof the thesaurus row-ID's x. A list of word indexes is then encoded foreach leaf of the expanded query tree. Accordingly, the tree expansionprocedure 191 is carried out with reference to the thesaurus word indexfiles, whereas they are not used in the processing of stage 192, whichdirectly calls the record chain head addresses by means of the wordindexes. This is useful when the word indexes do not coincide with thethesaurus row-ID's (contrary to FIGS. 25-27), which will normally happenas the database lives.

[0357] In the preferred case where separate coding data files are usedfor each thesaurus word, as in FIG. 17, the layer n processing of step264 is similar to that shown in FIG. 44. The loop is not performed in acommon data container (with the loop index AD), but in the individualcoding data files AT_P(q)_W_NOk and AT_P(q)_W_HPk (with a loop index ias in FIG. 33). Optimal disc access is ensured without any thesaurussorting. The layer k<n processing of step 265 does not need two loops asin FIG. 46. It may be in accordance with FIG. 49.

[0358] The first step 310 of the procedure shown in FIG. 49 consists inallocating the value AT_P(q)(x) of the word of rank x in the currentthesaurus to the word variable W, and in initiating the loop index i tozero. As long as i is lower than the total number imax(AT, q, W, k) oflayer k records in the coding data file relating to thesaurus AT,macroword level q and word W (test 311), steps 312-315 are performed. Instep 312, the rank AT_P(q)_W_NOk(i) is assigned to the integer variableN. Those rank data are read block by block to minimize the discaccesses. In the following test 313, the bit Res(N) of the layer k+1result bitmap Res is evaluated. If Res(N)=1, the bitmap segmentAT_P(q)W_HPk(i) is read in step 314 and superimposed, by an Boolean ORoperation, onto the segment WZ[N] of bitmap vector WZ in step 315,whereby any “1” in AT_P(q)_W_HPk(i) is written at the correspondingposition into WZ[N] and any “0” in AT_P(q)_W_HPk(i) leaves unchanged thecorresponding bit of WZ[N]. The bitmap segment data AT_P(q)_W_HPk(i) arealso read by blocks. In step 316, performed after step 315 or whenRes(N)=0 in test 313, the loop index i is incremented by one unit beforecoming back to test 311. When the relevant coding data have beencompletely examined (i=imax(AT, q, W, k) in test 311), it is determinedwhether the current word x is the last one x2 of the BETWEEN range (test317). If x<x2, the thesaurus pointer x is incremented by one unit instep 318 before coming back to step 310 for the next iteration of theloop. The loop is over when x=x2 in test 317.

[0359]FIG. 50 shows an alternative way of performing the leaf processingof FIG. 42 (when test 252 is positive), in the case where the codingformat of the flat file row-ID lists is specified in the thesaurus indexregisters, as shown in the right part of FIGS. 25-27.

[0360] The initialization step 260A is similar to that 260 of FIG. 42,except that the rank table TNO is initialized to the default value atthe same time. In step 280A, the word index WI=AT_P(q)_WI(x) and thecorresponding format F=AT_P(q)_FORMAT(x) are read from the AT level qthesaurus index register. If F designates “low density” (test 261A), theloop 271-276 depicted in FIG. 43 is executed in step 262A. Otherwise (Fdesignates “normal density” with n coding layers), the head addressAD=AT P(q)_Fk(WI) is read in step 281A to initiate the scanning of arecord chain. If we are in the first coding layer k=n (test 263A), theloop 282-285 depicted in FIG. 44 is executed in step 264A. Otherwise,the first loop 282-285 of FIG. 46 is executed in step 265A. After step262A, 264A or 265A, the current thesaurus pointer x is compared with theupper bound x2 of the BETWEEN range in test 286A, to be incremented instep 2870A before coming back to step 280A if x<x2. When x=x2 in test286A, the table TNO is exploited in step 301A, which is identical to thesecond loop 301-306 of FIG. 46, in order to complete the bitmap vectorWZ returned in step 266 of FIG. 42.

Query Output

[0361] The SQL query further specifies how the data matching the querycriteria should be presented in the response. Therefore, the next stage193 of the query processing (FIG. 36) is the preparation of the resultsfor their display in stage 194.

[0362] Typically, the query defines a list of attributes whose valuesshould be included in the displayed response (“SELECT” and “FROM”clauses in the SQL query, with FROM specifying the relevant data tablesand SELECT specifying the relevant columns in those tables).

[0363] When a link table of the type shown in FIG. 9 is stored, thecolumns of that link table corresponding to the listed attributes areread in the matching rows, identified in the bitmap vector Res output instep 246 of FIG. 41, in order to obtain the links pointing to therelevant data tables. The attribute values are then retrieved from thedata tables for display.

[0364] Another possibility is to scan the thesaurus relating to suchattribute and to compute the bitwise Boolean AND between the resultbitmap vector Res and each encoded bitmap vector of the thesaurus. Everytime there is a hit between those vectors (a “1” in the AND outputvector), the corresponding thesaurus word will be displayed or otherwiseprocessed. This permits the attribute values of the response to beretrieved without using any link or data table.

[0365] The AND operations may be performed directly in layer 1. They canalso be performed as previously, by decrementing the layer index fromk=n to k=1. This requires the layer k results which can be calculatedfrom the layer 1 bitmap vector Res. The latter option optimizes the discaccess by taking advantage of the multi-layer VDG compression scheme.

[0366] Such scanning may also be accelerated by taking advantage of themacroword thesauruses. The highest level thesaurus of an attribute isfirst scanned, and the portions of the lower level thesaurus(es) coveredby a given macroword are scanned only if a hit has been observed for themacroword.

[0367]FIG. 51 shows a procedure suitable for accessing the values to beincluded in the response for a given attribute AT by scanning thecorresponding macroword and/or word thesauruses, which fully takesadvantage of both the macroword grouping and the VDG compression scheme.

[0368] As before, it is assumed that the attribute AT has a number Q+1≧1of thesauruses indexed by a level parameter q with 0≦q≦Q, havingrespective prefix lengths P(q) with P(0)>P(1)> . . . >P(Q), the levelparameter q=0 designating the individual word thesaurus, whose prefixlength corresponds to the attribute word length. In the notations ofFIG. 45:

[0369] QA is an integer with 0≦QA≦Q representing a degree of accuracyexpected in the query result; QA is set to 0 for maximum accuracy;

[0370] the thesaurus pointer x_(q) is a row-ID in the AT thesaurus indexregister of level q;

[0371] for q≧QA, WZ1_(q) is a bitmap vector which represents a layer qtarget list of data graph identifiers which match the query criteria andshould be examined in connection with the level q thesaurus word x_(q).In the initialization step 320, the result bitmap vector Res, output instep 246 of FIG. 41, is assigned to the vector WZ1_(Q) which thusrepresents the flat file row-ID's matching the query criteria;

[0372] for k>1, WZk_(q) designates a bitmap vector in which each bit ofrank N (i.e. the (N+1)-th bit) indicates whether the (N+1)-th segment ofD(k−1) bits of WZ(k−1)q includes at least one “1”, in accordance withthe VDG compression scheme (0≦q≦Q). WZk_(q) is referred to as a layer kand level q filtering list for QA≦q≦Q and 1≦k≦n. Working zones arereserved in RAM 103 for containing the bitmap vectors WZk_(q) which neednot be stored in the hard drive.

[0373] In the initialization step 320, the indexes q and x_(Q) are setto q=Q and x_(Q)=0, in order to start scanning the highest levelthesaurus. In the conversion step 321, the bitmap vector WZ1_(q) isprocessed to provide the corresponding higher layer vectors WZk_(Q)(1<k≦n).

[0374] The coding layer index k is set to n in step 322, and a functionFINTER is called (step 323) to determine the intersection between theinteger list represented in the layer k coding data of the thesaurusentry x_(q) and the filtering list represented by the bitmap vectorWZk_(q). The input to this function comprise the coding layer index k,the (macro)word thesaurus level q, the (macro)word index x=x_(q), andthe bitmap vector WZ=WZk_(q). Its output is another bitmap vector havingthe same dimension, noted WX, which represents the integer listintersection.

[0375] The bitmap vector WX output by the function FINTER called in step323 is tested in step 324 to determine whether at least one of its bitsis “1”. If not, the (macro)word pointed to by x_(q) does not cover anyattribute value relevant to the query, so that the thesaurus pointerx_(q) is incremented by one unit in step 325, and the program comes backto step 322 to examine the next (macro)word of the level q thesaurus.

[0376] If the bitmap vector WX has at least one “1” and if k>1(following test 326), the layer index k is decremented by one unit instep 327. The next layer processing is then started from step 323.

[0377] When k=1 in test 326, WX≠0 is the bitmap representation of thelist of flat file row-ID's which are represented both in the resultbitmap vector Res and in the coding data of the current (macro)wordx_(q).

[0378] If q>QA (test 330), this bitmap vector WX is saved as WZ1_(q−1)in step 331. The row-ID AT_P(q)_FW(x_(q)) of the first “child” ofmacroword x_(q) in the lower level thesaurus is then read in the level qthesaurus and assigned as a starting value of the thesaurus pointerx_(q−1) (step 332). The thesaurus level q is then decremented by oneunit in step 333, and the lower level processing is started from step321.

[0379] When q=QA in test 330, the word pointed to by x_(QA) (if QA=0),or a word covered by the macroword pointed to by x_(QA) (if QA>0), is anattribute value of a data graph matching the query criteria. In step335, a certain action is taken based on this word or its thesauruspointer x_(QA) and the corresponding bitmap vector WX. The latter vectoridentifies the rows of the flat file which contain the (macro)wordx_(QA) in the AT column and which satisfy the query criteria. The typeof action depends on the SQL query. Different possibilities will bedescribed further on.

[0380] After step 335, the higher level bitmap vectors WZ1_(q) areupdated to remove any “1” present at the same location as in WX. Such a“1” stands for a data graph having the word pointed to by x_(QA) (ifQA=0), or a word covered by the macroword pointed to by x_(QA) (ifQA>0), as the value of attribute AT; therefore, no other word will havea hit with it, so that it can be removed. To initialize the update, theindex q is taken equal to Q in step 336. In step 337, the Booleanoperation WZ1_(q) AND NOT WX is performed bit by bit, and the resultbecomes the updated WZ1_(q). If the resulting bitmap vector WZ1_(q) hasat least one “1” remaining (test 338), the thesaurus level index q isdecremented by one unit in step 339, and step 337 is repeated.

[0381] If WZ1_(q) consists only of zeroes in test 338, it is notnecessary to continue the update in the lower levels. If q<Q (test 340),the (macro)word pointed to by x_(q) does not cover any more attributevalue relevant to the query: the thesaurus pointer x_(q) is incrementedin step 341, and the program comes back to step 321 to examine the next(macro)word of the level q thesaurus.

[0382] The scanning of the thesauruses for attribute AT is over when q=Qin test 340.

[0383] This function FINTER called in step 323 may be in accordance withthe flow chart shown in FIG. 52 when the thesauruses are stored as shownin FIGS. 25-32. It is started in step 350 by loading the above-mentionedinput arguments k, q, x (=x_(q)) and WZ (=WZk_(q)). In step 351, thebitmap vector WX is initialized with zeroes. The program first obtainsthe word index WI=AT_P(q)_WI(x) in step 352, and then the head addressAD=AT_P(q)_Fk(WI) in step 353 to initiate the scanning of the relevantrecord chain in the data container.

[0384] If the level q thesaurus entry x for attribute AT is stored inthe “low density” format (test 354), the processing is as describedbelow with reference to FIG. 53 (step 355) to obtain the intersectionvector WX. If the format is “normal density”, the processing depends onwhether the program is in the first layer, that is k=n (test 356). Theprocessing of FIG. 54 is applied if k=n (step 357), and that of FIG. 55if k<n (step 358). After step 355, 357 or 358, the execution of functionFINTER is terminated in step 359 by returning the bitmap vector WX.

[0385] The low density processing of FIG. 53 has a loop in which eachiteration begins by comparing the address AD with the end-of-chain value(0) in test 360. If AD>0, there remains at least one item to be examinedin the record chain, so that the flat file row-ID value NO(AD) and thenext address NX(AD) are read as variables N and M, respectively, in step361. The Euclidean division of N by Δk is made in step 362 to obtain thelayer k−1 quotient (rank) N′. If WZ(N′)=1 in the following test 363, a“1” is written into bit WX(N′) of the bitmap vector WX (step 364). Afterstep 364, or if WZ(N′)=0 in test 363, the variable M is substituted forAD in step 365 before coming back to test 360. The low densityprocessing for the current (macro)word is over when the record chain hasbeen completely examined (AD=0 in test 360), and the program proceeds tostep 359 of FIG. 52.

[0386] The layer n normal density processing of FIG. 54 has a similarloop in which each iteration begins, in step 370, by comparing theaddress AD with the end-of-chain value (0). If AD>0, the layer n rankvalue NOn(AD) and the next address NXn(AD) are read as variables N andM, respectively, in step 371. If the segment of rank N in the bitmapvector WZ has at least one “1” (WZ[N]≠0 in the following test 372), thebitmap segment HPn(AD) is read (step 373) and combined with the bitmapsegment WZ[N] in a bitwise Boolean AND operation to provide the segmentWX[N] of the bitmap vector WX (step 374). After step 374, or if WZ[N]=0in test 372, the variable M is substituted for AD in step 375 beforecoming back to test 370. The layer n normal density processing for thecurrent (macro)word is over when the record chain has been completelyexamined (AD=0 in test 370), and the program proceeds to step 359 ofFIG. 52.

[0387] The layer k<n normal density processing is advantageously made oftwo successive loops (FIG. 55). The first loop is for determining atemporary rank table TNO, which is used to handle the bitmap segments inthe second loop, like in the procedure described previously withreference to FIG. 46. Table TNO has a number of addresses which is atleast equal to the number of addresses ADmax of the data container inwhich the layer k coding data of the current thesaurus (AT, q) arestored. Each entry TNO(AD) of address AD in the rank table TNO is forcontaining an integer representing the rank NOk(AD) if it is useful toaccess the bitmap segment HPk(AD), or else a default value (−1). Suchaccess is useless if NOk(AD) does not belong to the layer k rank listassociated with the current (macro)word x_(q), or if there are onlyzeroes in the segment of rank NOk(AD) in the bitmap vector WZ=WZk_(q+1).

[0388] In the initialization step 380, all entries of the rank table TNOare set to the default value −1. Each iteration of the first loop beginsin step 381 by comparing the address AD with the end-of-chain value (0).If AD>0, the layer k rank value NOk(AD) and the next address NXk(AD) areread as variables N and M, respectively, in step 382. The segment WZ[N]of rank N in the bitmap vector WZ is examined in test 383. If thatsegment WZ[N] has at least one “1” (WZ[N]≠0 in test 383), the rank N iswritten at address AD into table TNO in step 384 before substituting Mfor AD in step 385 and coming back to test 381 to examine the nextrecord of the chain. Otherwise (WZ[N]=0), the rank N is filtered out byjumping directly to step 385.

[0389] The first loop is over when the record chain has been completelyexamined (AD=0 in test 381). The program then proceeds to the secondloop 386-391. In each iteration of the second loop, the contents N ofthe rank table TNO at address AD, read in step 387 after havingincremented AD in step 386, are compared with the default value in test388. If N is a valid rank value (≠−1), the bitmap segment HPk(AD) isread (step 389) and combined with the bitmap segment WZ[N] in a bitwiseBoolean AND operation to provide the segment WX[N] of rank N in thebitmap vector WX (step 390). If AD<ADmax (test 391), the rank tableaddress AD is incremented by one unit in step 386 when starting the nextiteration. The second loop is over when AD=ADmax in test 391, and theprogram proceeds to step 359 of FIG. 52.

[0390] The scanning of the thesauruses as explained with reference toFIGS. 51-55 has a number of significant advantages:

[0391] it does not require any access to the original data tables.Therefore it is not compulsory to maintain the data tables in memory.Even when they are stored, they will often be accessible through arelatively low software interface, such as ODBC. The scanning methodadvantageously circumvents that interface;

[0392] it is very efficient in terms of disc accesses, because it takesadvantage of the record grouping in the coding data container. Theprocedures of FIGS. 53-55 are respectively similar to those of FIGS. 43,44 and 46 regarding the disc accesses, and they provide theabove-described advantages in this respect;

[0393] the procedure of FIG. 51 is also very efficient owing to thefiltering achieved by the updating of the bitmap vectors WZ1_(q) (loop336-339) This filtering takes advantage of the fact that each flat filerow has a unique value (possibly Null) for each attribute. It avoidsplenty of useless operations to read coding data pertaining tosubsequent thesaurus words and macrowords which would not provide hitsin the lowest layer (because the hit in the higher layer would be due toa flat file row-ID corresponding to an already considered thesaurusword).

[0394]FIG. 56 shows how the procedure of FIG. 51 can be adapted when thecoding data containers are stored as illustrated in FIGS. 25-30 and34A-B. The above-described function FINTER is replaced by a recursivefunction FFILT illustrated by FIG. 57. Accordingly, the loop 322-327 isreplaced by a loop 590-593 after executing steps 351-353 as in FIG. 52(with k=n). If the resulting intersection bitmap WX is made of zeroesonly (test 324), x_(q) is incremented in step 325 before coming back tostep 351 for the next (macro)word of the current level q thesaurusrange. If WX has at least one “1” in test 324, the program proceeds tostep 330 as described before. Otherwise, the procedure of FIG. 56 is thesame as that of FIG. 51.

[0395] Each iteration in the loop 590-593 begins by comparing theaddress AD with the end-of-chain value (0) in test 590. If AD>0, thelayer n rank value NOn(AD) and the next address NXn(AD) are read asvariables N and M, respectively, in step 591. Afterwards, the filteringand intersection function FFILT is called in step 592 beforesubstituting M for AD in step 593. The computation of the intersectionlist WX for the current (macro)word is over when the layer n recordchain has been completely examined (AD=0 in test 590), and the programproceeds to test 324 as indicated hereabove.

[0396] A flow chart of this function FFILT is shown in FIG. 57. Itsarguments, input when starting its execution in step 600, are asfollows:

[0397] a coding layer index k, with k=n when the function FFILT iscalled in step 592 of FIG. 56;

[0398] k bitmap vectors WZ1_(q), WZ2_(q), . . . , WZk_(q) as obtained instep 321 of FIG. 56;

[0399] a layer k rank N, with N=NOn(AD) when the function FFILT iscalled in step 592 of FIG. 56;

[0400] the corresponding record address AD in the layer k datacontainer; and

[0401] the intersection bitmap vector WX which is being calculated.

[0402] In test 601, it is determined whether the segment of rank N ofthe bitmap vector WZk_(q) is only made of zeroes. If so, it is notnecessary to read any further coding data relating to the layer k rankN, so that the execution of the function is terminated in step 602 byreturning the bitmap vector WX.

[0403] If the segment WZk_(q)[N] has at least one “1” in test 601, thebitmap segment HPk(AD) is read as segment variable H in step 603, andthe intersection segment H AND WZk_(q)[N] is evaluated in test 604. Ifthis intersection segment is only made of zeroes, it is also useless toread any further coding data, and the program directly proceeds to step602.

[0404] If test 604 reveals that HAND WZk_(q)[N] has at least one “1”, itis necessary to get into the lower layer record chain. Its head addressF(k−1)(AD) is read as variable AD′ in step 605, while the layer kremainder R is initialized to 0 and the layer k−1 rank N′ is initializedto N×Dk. The bitmap segment H=HPk(AD) is scanned in a loop in which itsbits H(R) are successively examined (test 606) to ascertain whether therank N′=N×Dk+R should be regarded. If H(R)=0, the rank N′ is not in thelayer k coding data of the current thesaurus entry, so that it isdisregarded: R is incremented by one unit in step 607 and if the new Ris still smaller than Dk (test 608), N′ is also incremented by one unitin step 609 before proceeding to the next iteration from test 606.

[0405] If H(R)=1 in test 606, the bit of rank N′ of the vector WZk_(q)is examined in test 610 to determine whether the layer k−1 rank N′ is inthe result list. If not (WZk_(q)(N′)=0), the program jumps to the nextposition in the layer k−1 record chain by replacing AD′ by the nextaddress NX(k−1)(AD′) in step 611. After step 611, the program proceedsto the above-described step 607.

[0406] If WZk_(q)(N′)=1 in test 610, the processing depends on whetherthe coding layer k is immediately above 1 (test 612). If k=2, the bitmapsegment HP1(AD′) is read (step 613) and combined with the bitmap segmentWZ1_(q)[N′] in a bitwise Boolean AND operation to provide the segmentWX[N′] of rank N′ in the bitmap vector WX (step 614). If k>2 in test612, the recursive function FFILT is called in step 615 with thearguments k, WZ1_(q), . . . , WZ(k−1)_(q), N′, AD′ and WX. After step614 or 615, the program proceeds to the above-described step 611.

[0407] The scanning of the bitmap segment H=HPk(AD) is over when R=Dk intest 608. The updated bitmap vector WX is then returned in step 602.

[0408] It is noted that the use of a layer 1 rank table TNO (as in FIG.55) is quite compatible with the procedure of FIGS. 56-57. The recordsof the table TNO are initialized with the default value in step 351;steps 613-614 of FIG. 57 are replaced by writing N′ into TNO(AD′); andwhen AD=0 in test 590, table TNO is scanned as in loop 386-391 of FIG.55.

[0409] A further optimization of the procedure of FIG. 51 or 56 can beachieved when the stored thesaurus data include files organized asillustrated in FIGS. 58-61. For each thesaurus, a table of the typeshown in FIGS. 58-60 is stored, to associate each possible value of thelayer n rank NOn with a record chain head address F_AD′ in an additionaldata container as shown in FIG. 61. The latter data container containsthe same layer n bitmap segment data HP′2=HP2 as that of FIG. 32 or 34A,but the links NX′2 define record chains which pertain to the same layern rank rather than to the same thesaurus entry. The data container ofFIG. 61 is thus obtained by sorting that of FIG. 32 or 34A based on theNO2 column, deleting the NO2, NX2 and F1 columns, and adding a columnNX′2 to contain the next addresses in the record chains based on NO2 anda further column PTR where the thesaurus indexes x to which the recordpertain is written. For each rank NO2 the head address of the chain ismemorized in F_AD′(NO2).

[0410] Before starting the procedure of FIG. 51 or 56, or after everyiteration of step 321, the pre-filtering treatment shown in FIG. 62 isapplied to mark thesaurus entries that will not be read for the reasonthat their layer n ranks are not in the layer n coding data of thematching data graph identifier list. The marking is done by means of atable T_(q) for a macroword level q, which has one bit T_(q)(x_(q)) foreach level q thesaurus pointer x_(q). Those bits, as well as the layer nrank N=NOn are initialized to zero in step 620 of FIG. 62. If thesegment of rank N of WZn_(q) is only made of zeroes (test 621), test 622is performed to determine whether the highest possible layer n rankNOn_(max) has been reached. If not, N is incremented in step 623 andtest 621 is repeated. When WZnq[N]≠0 in test 621, the head addressF_AD′(N) is read as variable AD′ in step 624 and compared to theend-of-chain value (0) in test 625. If AD′=0, the program proceeds tostep 622. Otherwise, the bitmap segment HP′n(AD′) and the correspondingnext address value NX′n(AD′) are read as variables H′ and M′,respectively, in step 626. If H′ and WZn_(q)[N] have no “1” in common(test 627), M′ is substituted for AD′ in step 628, and the nextiteration is started from test 625. If there is at least one “1” in thebitwise Boolean AND combination of H′ and WZn_(q)[N] in test 627, thethesaurus pointer x_(q)=PTR(AD′) is read in the last column of FIG. 61,and a “1” is written in the corresponding location of table Tq beforeproceeding to step 628.

[0411] After that, as shown in FIG. 63, the loop 322-327 of FIG. 51,where a relevant bitmap vector WX is calculated, is completed by aninitial filtering step 640 where the bit T_(q)(x_(q)) is tested. Thistest 640 is also performed after having incremented x_(q) in step 325.If T_(q)(x_(q))=1 in test 640, the program proceeds to step 322 asdescribed before. If T_(q)(x_(q))=0, it proceeds directly to step 325,thereby avoiding the computation of an intersection list WX that will beempty.

[0412] The same filtering step 640 can be performed before step 351 inFIG. 56.

[0413] The function FINTER illustrated in FIGS. 52-55 is readily adaptedto the case where separate coding data files are used for each thesaurusword, as in FIG. 17. Steps 352-353 of FIG. 52 are replaced by theallocation of the value AT_P(q)(x) to the word variable W, and by theinitialization of the loop index i to zero. The low density processingof step 355 and the layer n normal density processing of step 357 aresimilar to those shown in FIGS. 53 and 54. The loop is not performed ina common data container (with the loop index AD), but in the individualcoding data files (with a loop index i as in FIG. 33). The layer k<nprocessing of step 358 does not need two loops as in FIG. 55. It may bein accordance with FIG. 64.

[0414] In the procedure shown in FIG. 64, steps 395-399 are performed aslong as the loop index i is lower than the total number imax(AT, q, W,k) of layer k records in the coding data file relating to thesaurus AT,macroword level q and word W (test 394). In step 395, the rankAT_P(q)_W_NOk(i) is assigned to the integer variable N. In the followingstep 396, the segment WZ[N] of rank N in the bitmap vector WZ is tested.If WZ[N] has at least one “1” (WZ[N]≠0), the bitmap segmentAT_P(q)_W_HPk(i) is read (step 397) and combined with the bitmap segmentWZ[N] in a bitwise Boolean AND operation to provide the segment WX[N] ofrank N in the bitmap vector WX (step 398). In step 399, performed afterstep 398 or when WZ[N]=0 in test 396, the loop index i is incremented byone unit before coming back to test 394. The loop is over when therelevant coding data have been completely examined, i.e. when i=imax(AT,q, W, k) in test 394.

[0415] The above-described procedure may involve different types ofaction in step 335 of FIG. 51 or 56, based on features of the SQL query.

[0416] In a relatively simple type of SQL query, a list of values of oneattribute is required (e.g. name all clients who meet certain criteria).In such a case, the scanning of FIG. 51 or 56 is performed only in thethesaurus(es) relating to that attribute, with QA=0, and the action ofstep 335 may simply be to read the word AT(x₀) which is in position x₀in the individual word thesaurus (in fact, if the coding data are storedas illustrated in FIG. 17, the word AT(x₀) has been read just before)and to write this word AT(x₀) into an output table, or print it out. Itis observed that the word list thereby produced is automatically sorted,in the ascending order. If the reverse order is required, the thesaurusmay be scanned in the opposite direction.

[0417] If the SQL query has a DISTINCT keyword in the SELECT clauseregarding the attribute AT, there is one output of the word AT(x₀) instep 335. If not, or if the SELECT clause has the keyword ALL, there maybe one output of the word AT(x₀) for each non-zero bit of WX in step335. Those non-zero bits may also be counted to provide the number ofoccurrences of the word AT(x₀) in the matching data graphs.

[0418] If the values of the attribute are required with a reducedaccuracy, the thesaurus may be scanned as shown in FIG. 51 or 56 withQA>0, thereby avoiding memory accesses to obtain irrelevant details fromthe level q thesauruses with q<QA. For example, if a date attribute isrequired expressed in years, the scanning of FIG. 51 or 56 may bestopped at the level QA corresponding to a truncation length of 4.

[0419] The SQL query frequently requires several attributes in theSELECT and FROM clauses. In order to maintain the connections betweenthe attribute values belonging to the same data graph, some form ofindexing is needed. A possibility is to reserve in RAM 103 a workingzone for containing an output table having as many rows as in thevirtual flat file and respective columns for receiving the attributevalues of the result lists. The memory locations of the output table areinitialized with a default value. The above-mentioned attribute valuesAT(x₀), or their prefixes AT(x_(QA)) if QA>0, are written into theoutput table in the occurrences of step 335 shown in FIG. 51 or 56. Suchwrite operation in step 335 is made into any row of the output tableindicated by a non-zero bit of the bitmap vector WX. The output data areeventually produced by eliminating the empty rows from the output table(the rows that still contain the default value).

[0420]FIG. 65 shows how step 335 is developed in such a case, to writethe word W=AT(x_(QA)) where appropriate in the column OT_AT of theoutput table. The row pointer j is initialized to zero in step 400, andthe word W is loaded (if it has not been before). Every time the bitWX(j) is 1 (test 401), the word W is written into row j and column AT ofthe output table (step 402). The row pointer j is the compared to itsmaximum value jmax in test 403 and incremented if j<jmax (step 404). Theprogram has finished the action of step 335 when j=jmax in test 403.

EXAMPLE 4

[0421] We consider the query criteria of Example 3 and assume that theattributes requested for display are accident date, client name andpolicy date. In Example 3, discussed with reference to FIGS. 37-38, thebitmap of the matching data graphs (output in step 246 of FIG. 41) isRes=101100001000, as may be checked in FIG. 8. In this example, FIG. 66shows the contents of the output table as described hereabove.

[0422] The above-mentioned output table may be too big to beconveniently reserved in RAM 103. In real databases, the number of rowsin the virtual flat file is relatively high (e.g. millions) and if thereare too many characters in one row of the output table (because thereare too many attributes to be included or because some of them use arelatively high number of characters), the output table may becomeprohibitively big. There are several solutions to deal with thispotential problem.

[0423] One of them is to write the thesaurus row-ID's x_(QA) (integers)into the output table instead of the (macro)words AT(x_(QA)) in step 402of FIG. 65. Once all the relevant thesauruses have been scanned, thenon-empty output table rows are selected to retrieve the attributevalues from the thesaurus row-ID's. This reduces the breadth of thecolumns of the output table since the words AT(x_(QA)) often requiremuch more characters.

[0424]FIG. 67 shows the contents of such an output table in the case ofExample 4, the thesauruses being sorted as in FIGS. 10A-G.

[0425] Another solution, alternative or cumulative, is to use an indexin RAM 103, to associate an integer address with each data graph or flatfile row-ID. A default address is initially assigned to all the datagraphs. When one of them is designated for the first time by a “1” inthe corresponding bit of WX in step 335 (i.e. when scanning the firstthesaurus), it is allocated a new address obtained by incrementing acounter. This address is retrieved from the index when the data graph isagain designated in the scanning of the subsequent thesaurus(es). Thisinteger address is a row-ID in an output table stored in RAM 103, whichhas a reduced number of rows where the attribute values or prefixesAT(x_(QA)), or their thesaurus row-ID's x_(QA), are written. Thenon-empty rows are consecutive and hence the total number of rows can besignificantly reduced. This compressed output table is eventually readout to display the results.

[0426]FIG. 68 shows the contents of such index and output table,containing thesaurus row-ID's, in the case of Example 4.

[0427]FIGS. 69 and 70 show how step 335 is developed when scanning thefirst thesaurus and the subsequent thesaurus(es), respectively. Thesteps 400, 401, 403, 404 indicated by the same reference numerals areidentical to those of FIG. 65. In FIG. 69, when the bit WX(j) is 1, thecounter value m (initialized to 0 in step 320 of FIG. 51 or 56) isallocated to the index IND(j) for row j (step 410), the thesauruspointer x_(QA) (or word W=AT(x_(QA))) is written into row j and columnAT of the output table (step 411), and the counter value m isincremented (step 412). When the scanning the first thesaurus is over, mrepresents the number of matching data graphs. In FIG. 70, when the bitWX(j) is 1, the index IND(j) for row j is retrieved as pointer m′ (step413) and the thesaurus pointer x_(QA) (or word W) is written into row m′and column AT of the output table (step 414).

[0428] The output table is easily sorted based on the contents of itscolumns when the SQL query has GROUP BY, ORDER BY or similar clauses.Such sorting operation may be performed hierarchically with reference toa plurality of attributes. The most significant attribute in thehierarchy is preferably subjected to the first thesaurus scanning asshown in FIG. 51 or 56 so that the first sorting criterion will beautomatically fulfilled when constructing the output table. The sortingbased on the remaining attributes is done within each portion of theoutput table that has common values for the previous attribute(s).

[0429] The sorting is particularly simple when the columns of the outputtable contain thesaurus row-ID's x_(QA), as in FIG. 68, because it onlyinvolves sorting integer lists.

[0430] It has been indicated before that for certain attributes, inparticular numerical fields, the explicit attribute values may be storedin the link table (if there is a link table). The output table of thetype illustrated in FIG. 66, 67 or 68 need not have a column for suchattribute. If the attribute is to be displayed or otherwise exploited,its values can be retrieved from the link table in the rowscorresponding to (i.e. having the same row-ID as) the non-empty rows ofthe output table (FIGS. 66-67) or the valid pointers in the output tableindex (FIG. 68).

[0431] SQL queries may also require calculations to be made on attributevalues of the matching data records, particularly in data warehousingapplications. Such calculations can be performed from the data of anoutput table of the type illustrated in FIG. 66, 67 or 68.

EXAMPLE 5

[0432] From Example 4, we assume that the (arithmetic) mean value of thetime difference between the accident date and the policy date isrequested, expressed as a number of days. For each non-empty row of theoutput table, the program computes the difference, in number of days,between the first and third column. Those differences are accumulatedand the result is divided by the number of non-empty rows (4) to obtainthe desired mean value.

[0433] In fact, this mean value can be computed with an output tablereduced to only one memory location: when scanning the accident datethesaurus, the attribute value expressed as a number of days from anarbitrary reference day is multiplied by the number of non-zero bits inWX in step 335 of FIG. 51 or 56 and added to an accumulation variable V(initialized to 0 in step 320) stored in the memory location of thereduced output table; then, when scanning the policy date thesaurus, theattribute value expressed as a number of days from the same referenceday is multiplied by the number of non-zero bits in WX in step 335 andsubtracted from V in step 335; finally, the resulting V is divided bythe number of non-zero bits in the result bitmap Res to provide thedesired mean value.

[0434] However, an output or computation table having more than onememory location is often useful in RAM 103 for that sort ofcalculations, in particular in cases where the desired quantity is notlinear with respect to the attribute values (e.g. if the quadratic orgeometric, rather than arithmetic, mean value is requested in Example5).

[0435] A computation table is a particular case of output table, and ithas a structure similar to that of the output table described hereabove.It may have as many rows as in the virtual flat file (as the outputtables of FIGS. 66-67). Alternatively, it may be associated with anindex identical to that of FIG. 68. It may also have only one row, as inthe above example of the output table having one memory location. Eachcolumn of the computation table is for containing values of an operandused in the calculation to be made. Depending on the complexity of thecalculation, one or more columns may be needed, but in most cases onecolumn will be sufficient.

[0436] The attribute whose values are involved in the calculation havetheir thesauruses scanned successively, as described with reference toFIG. 51 or 56. Step 335 may be developed as shown in FIG. 71 in the caseof a computation table CT having a single column and as many rows as inthe virtual flat file (when there is an index, it can be handled as inFIGS. 69-70). In FIG. 71, steps 400, 401, 403 and 404 are identical tothose of FIG. 65. When the bit WX(j) is 1 in step 401, the contentsCT(j) of the computation table in row j is allocated to the operand Y instep 416, and then a function f of the operand Y and of the current(macro)word W=AT(x_(QA)) is calculated and saved as the new contentsCT(j) in step 417.

[0437] The mathematical function f is selected on the basis of thecalculation to be performed and of the thesaurus being scanned.Referring again to Example 5, when the accident date is first scanned,the function f(Y,W) may be the transformation of the date W expressed inthe format yyyy mm dd into a number of days from a reference day (it isthus a function of W only); when the policy date thesaurus is scanned,the function f(Y,W) may consist in applying the same transformation tothe date W and subtracting the result from Y. Afterwards, the mean value(arithmetic, quadratic, geometric, . . . ) of the non-empty rows of CTis calculated to provide the desired output result. Other kinds ofglobal calculation can be performed from the columns of the computationtable, for example statistical, financial or actuarial calculations.

[0438] The macrowords are advantageously used in this type ofcalculation if the desired accuracy is lower than that afforded by theindividual words of at least one of the attributes involved.

Virtual Flat File Partitioning

[0439] For large systems, it is often advantageous to partition thevirtual flat file into several portions or blocks each consisting of adetermined number of rows. The data graphs are distributed into theblocks based on their identifiers (flat file row-ID's).

[0440] Preferably, each thesaurus is divided into correspondingthesaurus sections, whereby each section has entries whose flat filerow-ID lists are included in the corresponding virtual flat file block.The complete flat file row-ID list associated with one word assigned toan attribute is the union of the lists represented in the entries of thecorresponding thesaurus sections for that word. Accordingly the completeflat file row-ID lists of the thesaurus entries are subjected to thesame partitioning as the virtual flat file: they are split intosub-lists corresponding to the thesaurus sections.

[0441] The thesaurus index file for an attribute may be common to allthe sections. A separate index file may also be provided for eachsection.

[0442] For each one of the blocks, steps 191-193 of the processing of aSQL query (FIG. 36) are performed as described hereabove with referenceto FIGS. 38-71. The results thus obtained are merged to display theresponse.

[0443] The processing of the query with respect to the different blocksmay be performed sequentially or in parallel.

[0444] In a sequential processing, RAM availability for optimalprocessing speed can be effectively controlled. Even though the cost ofRAM circuits is not currently considered to be critical, a given machinehas a certain amount of available RAM capacity and this is a limitationto reserve RAM space for the above-described output or computationtables. When the limitation is likely to be encountered, partitioningthe virtual flat file directly reduces the size of those tables (jmax inFIGS. 65 and 69-71).

[0445] Accordingly, the use of a particular machine to carry out theinvention will dictate the choice of jmax, that is the block size. Thevirtual flat file blocks are dimensioned based on the selected sizeparameter, and the corresponding thesaurus sections are constructed onesection after the other as indicated with reference to steps 122-126 ofFIG. 19.

[0446] Such dimensioning of the query processing engine enables to useoptimal algorithms at all stages while avoiding the need to swapintermediary data between RAM 103 and hard drive 105.

[0447] A further acceleration is achieved when parallel processing isused. The query processing is distributed between several processors,one for each virtual flat file block.

[0448] A possible architecture of the parallel query processing engineis illustrated in FIG. 72, in the particular case where all blocks havethe same size jmax. A number M of matching units 700 are connected to aquery server 701 through a communication network 702. Each matching unit700 may be a processor system of the type shown in FIG. 18. It has astorage device 703 such as a hard drive for storing the thesaurussections associated with the block. If a link table of the type shown inFIG. 9 is used, it is partitioned into blocks in the same manner as thevirtual flat file, and each block is stored in the correspondingmatching unit. The server 701 provides the man-machine interface. Ittranslates the query criteria of the SQL WHERE clause into trees of thetype shown in FIG. 37, which are provided to the M matching units 700along with a description of the desired output. Each of the units 700does its part of the job according to steps 191-193 of FIG. 36 andreturns its response to the server 701. The latter compiles the resultsfrom the different matching units to provide the overall response to theuser. In order to perform the analysis of step 191, each matching unit700 uses its thesaurus sections.

[0449] Alternatively, the analysis of the query criteria could beexecuted centrally by the server 701 by means of global thesauruses,each global thesaurus being common to all the (macro)words and having Mcolumns for containing pointers to identifier sub-lists in the M storageunits 703. At the end of the analysis stage, the relevant pointers areaddressed to the matching units 700 for their execution of steps192-193.

[0450] An update server 704, which may be the same machine as the queryserver 701, is also connected to the network 702 to create and maintainthe VDG's relating to the different blocks. It monitors the changes madein the data tables of the RDBMS and routes thesaurus update commands tothe units 700 in order to make the necessary changes in the thesaurussections.

[0451] The above-described parallel system is readily extended when thenumber of data graphs becomes close to the current maximum (M x jmax inthe illustration of FIG. 72). This requires the addition of a furthermatching unit to deal with a new virtual flat file block, whose size maybe the same as or different from the previous blocks, and areconfiguration of the routing and result compilation functions in theservers 701, 704. The reconfiguration is completely transparent to thepreviously existing matching units. Therefore, increasing the systemcapacity can be done at a minimum cost. It does not even require to shutdown the system.

1. A method of encoding integer lists in a computer system, comprisingthe steps of: dividing a range covering integers of an input list intosubsets according to a predetermined pattern; and producing coding dataincluding, for each subset containing at least one integer of the inputlist, data representing the position of said subset in the pattern, anddata representing the position of each integer of the input list withinsaid subset.
 2. A method according to claim 1, wherein the datarepresenting the position of each integer of the input list within asubset consist of a bitmap segment in which each bit is associated witha respective integer of the subset to indicate whether said integerbelongs to the input list.
 3. The method of claim 2, wherein theposition of each subset in the pattern is represented by an integer rankwhich is included in the coding data, in association with thecorresponding bitmap segment, if said subset contains at least oneinteger of the input list.
 4. A method according to claim 3, wherein acoding data container comprising records having respective addresses isprovided for storing together coding data produced from a plurality ofinteger lists, wherein each record of the coding data container includesa first field for storing an integer rank related to the pattern, asecond field for storing an address value and a third field for storinga bitmap segment, and wherein the encoding of a non-empty input listcomprises the steps of: a/ selecting an available record of the codingdata container; b/ selecting a subset containing at least one integer ofthe input list to which no record has been allocated; c/ allocating theselected record to the selected subset; d/ storing the rank and thebitmap segment of the coding data produced for the selected subset inthe first and third fields of the selected record, respectively; e/ ifevery subset containing at least one integer of the input list has arecord allocated thereto, storing an end value in the second field ofthe selected record; and f/ if at least one subset containing at leastone integer of the input list has no record allocated thereto, storingthe address of an available record of the coding data container in thesecond field of the selected record, selecting said available record andrepeating from step b/.
 5. A method according to claim 4, wherein thecoding data container has a first file comprising the first and secondfields of the records and a second file comprising the third fields ofthe records, the first and second files being accessible separately. 6.A method according to claim 4, further comprising the step of groupingthe records stored in the data container, so that the records allocatedto the subsets for any encoded integer list have contiguous addresses.7. A method according to claim 1, wherein the coding data produced fromone integer list are stored in at least one file allocated to said oneinteger list.
 8. A method according to claim 7, wherein the coding dataare stored in first and second files having a common addressing, wherebyfor each subset containing at least one integer of the input list, thedata representing the position of said subset in the pattern are storedin the first file and the data representing the position of each integerof the input list within said subset are stored at a correspondingaddress in the second file.
 9. A method according to claim 1, whereinthe subsets are consecutive intervals consisting of the same number ofintegers.
 10. A method according to claim 9, wherein said number ofintegers is a whole power of
 2. 11. A method of encoding integer listsin a computer system, comprising n successive coding layers, n being anumber at least equal to 1, wherein each coding layer comprises thesteps of: dividing a range covering integers of an input list of saidlayer into subsets according to a predetermined pattern; producingcoding data including, for each subset containing at least one integerof the input list, data representing the position of each integer of theinput list within said subset and, at least if said layer is the lastcoding layer, data representing the position of said subset in thepattern; if said layer is not the last coding layer, forming a furtherinteger list representing the position, in the pattern of said layer, ofeach subset containing at least one integer of the input list, andproviding said further integer list as an input list of the next layer.12. A method according to claim 11, wherein, in the pattern of eachlayer, the subsets are consecutive intervals consisting of the samenumber of integers.
 13. A method according to claim 12, wherein saidnumber of integers is a whole power of 2 for each layer.
 14. A methodaccording to claim 11, wherein the coding data produced for each layerare stored in first and second files having a common addressing, wherebyfor each subset containing at least one integer of the input list ofsaid layer, the data representing the position of said subset in thepattern are stored in the first file and the data representing theposition of each integer of the input list within said subset are storedat a corresponding address in the second file.
 15. A method according toclaim 11, wherein the coding data produced from one integer list inputin the first layer are stored in at least one file allocated to said oneinteger list.
 16. A method according to claim 11, wherein the codingdata produced from one integer list input in the first layer are storedas at least one record chain in a data container allocated to aplurality of integer lists.
 17. A method according to claim 16, furthercomprising the step of grouping the records of the data container sothat the records of each chain have contiguous addresses.
 18. A methodaccording to claim 11, wherein n≧2 and layer k data containers eachhaving a plurality of records are provided in a computer memory for1≦k≦n, each record of a layer k data container being associated with alayer k integer rank representing the position of a subset in the layerk pattern, and wherein each record of a layer k data containerassociated with a layer k rank representing the position of a subset inthe layer k pattern has a first field for containing data for retrievingthe position within said subset of any integer of a layer k input listrelating to a layer 1 input list, whereby a combination of said layer krank with any position retrievable from the data contained in said firstfield determines a layer k−1 rank with which a respective record of thelayer k−1 data container is associated if k>1, and an integer of saidlayer 1 input list if k=1.
 19. A method according to claim 18, whereineach record of the layer n data container associated with a layer n rankfurther has a second field for containing said layer n rank.
 20. Amethod according to claim 18, wherein, for 1≦k≦n, said data contained inthe first field of a record of the layer k data container for retrievingthe position of any integer of a layer k input list within a subsetcomprise a bitmap segment in which each bit is associated with arespective integer of said subset to indicate whether said integerbelongs to said layer k input list.
 21. A method according to claim 20,wherein, for 1≦k≦n, each record of the layer k data container associatedwith a layer k rank further has a second field for containing said layerk rank.
 22. A method according to claim 21, wherein each data containercomprises at least two files where the first and second fields of therecords of said data container are respectively stored, said files beingaccessible separately.
 23. A method according to claim 18, wherein, for1≦k≦n, each record of the layer k data container further has a secondfield for containing a number representing the position of an integer ofa layer k+1 input list within a subset of the layer k+1 pattern, andwherein, for 1<k≦n, said data contained in the first field of a recordof the layer k data container associated with a layer k rank forretrieving the position of any integer of a layer k input list within asubset of the layer k pattern comprise a pointer to at least one recordof the layer k−1 data container in which the second field contains anumber representing the position of an integer of said layer k inputlist within said subset of the layer k pattern, whereby said record ofthe layer k−1 data container is associated with the layer k−1 rankdetermined by the combination of said layer k rank with the positionrepresented by said number.
 24. A method according to claim 23, whereinsaid data contained in the first field of a record of the layer 1 datacontainer for retrieving the position of any integer of a layer 1 inputlist within a subset comprise a bitmap segment in which each bit isassociated with a respective integer of said subset to indicate whethersaid integer belongs to said layer 1 input list.
 25. A method accordingto claim 23, wherein each layer k data container for 1≦k<n comprises atleast two files where the first and second fields of the records of saiddata container are respectively stored, said files being accessibleseparately.
 26. A method according to claim 18, wherein, for 1≦k≦n, eachrecord of the layer k data container further has a next address field,whereby record chains are defined in the layer k data container by meansof the next address fields, and wherein at least some of the layer 1input lists are respectively associated with record chains in the layern data container, whereby the coding data for layer n relating to one ofsaid layer 1 input lists are stored in or retrievable from the recordchain associated therewith in the layer n data container.
 27. A methodaccording to claim 26, wherein, for 1≦k<n, said layer 1 input lists arerespectively associated with record chains in the layer k datacontainer, whereby the coding data relating to one of said layer 1 inputlists for layer k are stored in or retrievable from the record chainassociated therewith in the layer k data container.
 28. A methodaccording to claim 26, wherein, for 1<k≦n, each record of the layer kdata container further has a head address field for pointing to anaddress of a first record of a respective chain in the layer k−1 datacontainer.
 29. A method according to claim 26, wherein each layer k datacontainer for 1≦k≦n comprises at least two files where the first fieldsand the next address fields of the records of said data container arerespectively stored, said files being accessible separately.
 30. Amethod according to claim 26, further comprising the step of groupingthe records of the data container for each coding layer, so that therecords of each chain have contiguous addresses.
 31. A computerizedmethod of combining a plurality of first integer lists into a secondinteger list, wherein at least one of the first integer lists isrepresented by stored coding data provided by a coding scheme comprisingn successive coding layers, n being a number at least equal to 1, eachlayer having a predetermined pattern for dividing a range coveringintegers of an input list of said layer into subsets, said first integerlist being the input list of the first layer, wherein for any layerother than the last layer, an integer list representing the position, inthe pattern of said layer, of each subset containing at least oneinteger of the input list forms the input list for the next layer,wherein the stored coding data representing a first integer listcomprise, for each layer and each subset containing at least one integerof the input list, data representing the position of each integer of theinput list within said subset and, at least if said layer is the lastlayer, data representing the position of said subset in the pattern ofsaid layer, the method comprising the steps of: defining a combinationof intermediary lists each corresponding to at least one of the firstinteger lists; for k decreasing from n to 1, computing a layer k resultlist by combining a plurality of layer k intermediary lists inaccordance with said combination; and producing the second integer listas the layer 1 result list, and wherein, for any intermediary listcorresponding to at least one first integer list represented by storedcoding data, the layer n intermediary list is determined from saidstored coding data as consisting of the integers of any layer n inputlist associated with said at least one first integer list in the codingscheme and, if n>1, each layer k intermediary list for k<n is determinedfrom said stored coding data and the layer k+1 result list as consistingof any integer of a layer k input list associated with said at least onefirst integer list in the coding scheme which belongs to a layer ksubset whose position is represented in the layer k+1 result list.
 32. Amethod according to claim 31, wherein, in the pattern of each layer, thesubsets are consecutive intervals consisting of the same number ofintegers.
 33. A method according to claim 32, wherein said number ofintegers is a whole power of 2 for each layer.
 34. A method according toclaim 31, wherein, in the coding scheme, the coding data representingthe position of each integer of an input list within a subset for thecoding layer n define a layer n bitmap segment in which each bit isassociated with a respective integer of the subset to indicate whethersaid integer belongs to said input list, while the data representing theposition of said subset in the layer n pattern comprise a layer ninteger rank associated with said layer n bitmap segment, and whereinthe layer n intermediary list for an intermediary list corresponding toat least one first integer list represented by stored coding data isdetermined in a procedure comprising: initializing a layer n bitmapvector with logical zeroes; obtaining the layer n ranks and associatedbitmap segments from said stored coding data; and for each of said layern ranks, superimposing the layer n bitmap segment associated therewithonto a segment of said layer n bitmap vector having a positiondetermined by said layer n rank, the superimposition being performedaccording to a bitwise Boolean OR operation, said layer n intermediarylist corresponding to the resulting layer n bitmap vector.
 35. A methodaccording to claim 34, wherein n>1 and in the coding scheme, the codingdata representing the position of each integer of an input list within asubset for a coding layer k<n define a layer k bitmap segment in whicheach bit is associated with a respective integer of the subset toindicate whether said integer belongs to said input list, while thecoding data further comprise a layer k integer rank associated with saidlayer k bitmap segment to represent the position of said subset in thelayer k pattern, and wherein, for k<n, the layer k intermediary list foran intermediary list corresponding to at least one first integer listrepresented by stored coding data is determined in a procedurecomprising: initializing a layer k bitmap vector with logical zeroes;obtaining the layer k ranks from said stored coding data; and selectingany obtained layer k rank belonging to the layer k+1 result list andsuperimposing the associated layer k bitmap segment onto a segment ofsaid layer k bitmap vector having a position determined by the selectedlayer k rank, the superimposition being performed according to a bitwiseBoolean OR operation, said layer k intermediary list corresponding tothe resulting layer k bitmap vector.
 36. A method according to claim 35,wherein, for 1≦k<n, the layer k ranks and the layer k bitmap segmentsassociated therewith are stored at corresponding addresses in distinctfirst and second files, and said procedure for determining the layer kintermediary list for an intermediary list corresponding to at least onefirst integer list represented by stored coding data comprises:providing a rank table in a RAM memory, having records associated withthe addresses in said first and second files; filling the rank table bywriting any selected layer k rank into the rank table record associatedwith the address of the selected layer k rank in said first file; andfor any record of the filled rank table containing a layer k rank andassociated with an address in the second file, reading the associatedlayer k bitmap segment at said address in the second file andsuperimposing the read layer k bitmap segment onto a segment of saidlayer k bitmap vector having a position determined by said layer k rank.37. A method according to claim 34, wherein n>1 and for any coding layerk such that 1<k≦n, a layer k′ filtering list is determined for k≦k′≦n,said layer k′ filtering list being the layer k′ input list obtained byproviding the layer k result list as an input list in layer k of thecoding scheme, wherein, in the coding scheme, the coding datarepresenting the position of each integer of an input list within asubset for a coding layer k<n define a layer k bitmap segment in whicheach bit is associated with a respective integer of the subset toindicate whether said integer belongs to said input list, while a layerk integer rank associated with said layer k bitmap segment representsthe position of said subset in the layer k pattern, and wherein, fork<n, the layer k intermediary list for an intermediary listcorresponding to at least one first integer list represented by storedcoding data is determined in a procedure comprising: /a/ initializing alayer k bitmap vector with logical zeroes; /b/ selecting the layer nranks obtained from said stored coding data, and setting k′=n; /c/ foreach selected layer k′ rank: /c1/ if the selected layer k′ rankrepresents the position in the layer k′ pattern of a subset whichincludes at least one integer of the layer k′ filtering list, obtainingthe layer k′ bitmap segment with which the selected layer k′ rank isassociated; /c2/ for any integer of the layer k′ filtering list whoseposition within said subset is represented in said layer k′ bitmapsegment, selecting a respective layer k′−1 rank determined from theselected layer k′ rank and said position represented in said layer k′bitmap segment; /c3/ if k′>k+1, executing step /c/ with k′ decrementedby one unit; and /c4/ if k′−1=k, obtaining any layer k bitmap segmentwith which a selected layer k′−1 rank is associated, and superimposingsaid layer k bitmap segment onto a segment of said layer k bitmap vectorhaving a position determined by said selected layer k′−1 rank, thesuperimposition being performed according to a bitwise Boolean ORoperation, said layer k intermediary list corresponding to the resultinglayer k bitmap vector.
 38. A method according to claim 37, wherein, for1≦k<n, the layer k bitmap segments are stored in at least one layer kfile at addresses respectively corresponding to the layer k ranksassociated therewith, and said procedure for determining the layer kintermediary list for an intermediary list corresponding to at least onefirst integer list represented by stored coding data comprises:providing a rank table in a RAM memory, having records associated withthe addresses in said layer k file; filling the rank table by writingany selected layer k rank into the rank table record associated with theaddress corresponding to the selected layer k rank; and for any recordof the filled rank table containing a layer k rank and associated withan address in said layer k file, reading the associated layer k bitmapsegment at said address and superimposing the read layer k bitmapsegment onto a segment of said layer k bitmap vector having a positiondetermined by said layer k rank.
 39. A method according to claim 31,wherein a coding data container comprising records having respectiveaddresses is provided for each coding layer k≦n, for storing togetherlayer k coding data of a plurality of said first integer lists, andwherein each record of the coding data container for each layer includesa first field for storing a rank related to the pattern of said layer, asecond field for storing an address value and a third field for storinga bitmap segment, whereby said address value either points to anotherrecord of the data container where further layer k coding data relatingto the same first integer list are stored or designates an end of codingdata.
 40. A method according to claim 39, wherein the records stored inthe data container for each coding layer k are so grouped that therecords where the layer k coding data of any first integer list arestored have contiguous addresses.
 41. A method according to claim 31,wherein the coding data are stored in at least one file allocated to onefirst integer list.
 42. A method according to claim 41, wherein thecoding data of each layer are stored in first and second files having acommon addressing, whereby for each subset containing at least oneinteger of the input list of said layer, the data representing theposition of said subset in the pattern are stored in the first file andthe data representing the position of each integer of the input listwithin said subset are stored at a corresponding address in the secondfile.
 43. A method according to claim 31, wherein the intermediary listsinclude at least one preset list, said preset list consisting of one ofthe first integer lists for which the layer k input lists, according tothe coding scheme, are determined in advance for 1≦k≦n, said layer kinput lists being the respective layer k intermediary listscorresponding to said preset list.
 44. A computer program product forencoding integer lists in a computer system, comprising instructions forencoding the integer lists in accordance with n successive codinglayers, n being a number at least equal to 1, wherein a range coveringintegers of input lists of each layer is divided into subsets accordingto a predetermined pattern, wherein an integer list to be encoded is theinput list of the first layer, the computer program product comprising,for each coding layer: instructions for producing coding data including,for each subset containing at least one integer of the input list ofsaid layer, data representing the position of each integer of the inputlist within said subset and, at least if said layer is the last codinglayer, data representing the position of said subset in the pattern; ifsaid layer is not the last coding layer, instructions for forming afurther integer list representing the position, in the pattern of saidlayer, of each subset containing at least one integer of the input list,and for providing said further integer list as an input list of the nextlayer.
 45. A computer program product according to claim 44, wherein, inthe pattern of each layer, the subsets are consecutive intervalsconsisting of the same number of integers.
 46. A computer programproduct according to claim 45, wherein said number of integers is awhole power of 2 for each layer.
 47. A computer program productaccording to claim 44, further comprising instructions for storing thecoding data produced for each layer in first and second files having acommon addressing, whereby for each subset containing at least oneinteger of the input list of said layer, the data representing theposition of said subset in the pattern are stored in the first file andthe data representing the position of each integer of the input listwithin said subset are stored at a corresponding address in the secondfile.
 48. A computer program product according to claim 44, furthercomprising instructions for storing the coding data produced from oneinteger list input in the first layer in at least one file allocated tosaid one integer list.
 49. A computer program product according to claim44, further comprising instructions for storing the coding data producedfrom one integer list input in the first layer as at least one recordchain in a data container allocated to a plurality of integer lists. 50.A computer program product according to claim 49, further comprisinginstructions for grouping the records of the data container so that therecords of each chain have contiguous addresses.
 51. A computer programproduct according to claim 44, wherein n≧2 and layer k data containerseach having a plurality of records are provided in a computer memory for1≦k≦n, each record of a layer k data container being associated with alayer k integer rank representing the position of a subset in the layerk pattern, and wherein each record of a layer k data containerassociated with a layer k rank representing the position of a subset inthe layer k pattern has a first field for containing data for retrievingthe position within said subset of any integer of a layer k input listrelating to a layer 1 input list, whereby a combination of said layer krank with any position retrievable from the data contained in said firstfield determines a layer k−1 rank with which a respective record of thelayer k−1 data container is associated if k>1, and an integer of saidlayer 1 input list if k=1.
 52. A computer program product according toclaim 51, wherein each record of the layer n data container associatedwith a layer n rank further has a second field for containing said layern rank.
 53. A computer program product according to claim 51, wherein,for 1≦k≦n, said data contained in the first field of a record of thelayer k data container for retrieving the position of any integer of alayer k input list within a subset comprise a bitmap segment in whicheach bit is associated with a respective integer of said subset toindicate whether said integer belongs to said layer k input list.
 54. Acomputer program product according to claim 53, wherein, for 1≦k≦n, eachrecord of the layer k data container associated with a layer k rankfurther has a second field for containing said layer k rank.
 55. Acomputer program product according to claim 54, wherein each datacontainer comprises at least two files where the first and second fieldsof the records of said data container are respectively stored, saidfiles being accessible separately.
 56. A computer program productaccording to claim 51, wherein, for 1≦k<n, each record of the layer kdata container further has a second field for containing a numberrepresenting the position of an integer of a layer k+1 input list withina subset of the layer k+1 pattern, and wherein, for 1<k≦n, said datacontained in the first field of a record of the layer k data containerassociated with a layer k rank for retrieving the position of anyinteger of a layer k input list within a subset of the layer k patterncomprise a pointer to at least one record of the layer k−1 datacontainer in which the second field contains a number representing theposition of an integer of said layer k input list within said subset ofthe layer k pattern, whereby said record of the layer k−1 data containeris associated with the layer k−1 rank determined by the combination ofsaid layer k rank with the position represented by said number.
 57. Acomputer program product according to claim 56, wherein said datacontained in the first field of a record of the layer 1 data containerfor retrieving the position of any integer of a layer 1 input listwithin a subset comprise a bitmap segment in which each bit isassociated with a respective integer of said subset to indicate whethersaid integer belongs to said layer 1 input list.
 58. A computer programproduct according to claim 56, wherein each layer k data container for1≦k<n comprises at least two files where the first and second fields ofthe records of said data container are respectively stored, said filesbeing accessible separately.
 59. A computer program product according toclaim 51, wherein, for 1≦k≦n, each record of the layer k data containerfurther has a next address field, whereby record chains are defined inthe layer k data container by means of the next address fields, andwherein at least some of the layer 1 input lists are respectivelyassociated with record chains in the layer n data container, whereby thecoding data for layer n relating to one of said layer 1 input lists arestored in or retrievable from the record chain associated therewith inthe layer n data container.
 60. A computer program product according toclaim 59, wherein, for 1≦k<n, said layer 1 input lists are respectivelyassociated with record chains in the layer k data container, whereby thecoding data relating to one of said layer 1 input lists for layer k arestored in or retrievable from the record chain associated therewith inthe layer k data container.
 61. A computer program product according toclaim 59, wherein, for 1<k≦n, each record of the layer k data containerfurther has a head address field for pointing to an address of a firstrecord of a respective chain in the layer k−1 data container.
 62. Acomputer program product according to claim 59, wherein each layer kdata container for 1≦k≦n comprises at least two files where the firstfields and the next address fields of the records of said data containerare respectively stored, said files being accessible separately.
 63. Acomputer program product according to claim 59, further comprisinginstructions for grouping the records of the data container for eachcoding layer, so that the records of each chain have contiguousaddresses.
 64. A computer program product for combining a plurality offirst integer lists into a second integer list, wherein at least one ofthe first integer lists is represented by stored coding data provided bya coding scheme comprising n successive coding layers, n being a numberat least equal to 1, each layer having a predetermined pattern fordividing a range covering integers of an input list of said layer intosubsets, said first integer list being the input list of the firstlayer, wherein for any layer other than the last layer, an integer listrepresenting the position, in the pattern of said layer, of each subsetcontaining at least one integer of the input list forms the input listfor the next layer, wherein the stored coding data representing a firstinteger list comprise, for each layer and each subset containing atleast one integer of the input list, data representing the position ofeach integer of the input list within said subset and, at least if saidlayer is the last layer, data representing the position of said subsetin the pattern of said layer, the computer program product comprising:instructions for defining a combination of intermediary lists eachcorresponding to at least one of the first integer lists; for kdecreasing from n to 1, instructions for computing a layer k result listby combining a plurality of layer k intermediary lists in accordancewith said combination; and instructions for producing the second integerlist as the layer 1 result list, whereby, for any intermediary listcorresponding to at least one first integer list represented by storedcoding data, the layer n intermediary list is determined from saidstored coding data as consisting of the integers of any layer n inputlist associated with said at least one first integer list in the codingscheme and, if n>1, each layer k intermediary list for k<n is determinedfrom said stored coding data and the layer k+1 result list as consistingof any integer of a layer k input list associated with said at least onefirst integer list in the coding scheme which belongs to a layer ksubset whose position is represented in the layer k+1 result list.
 65. Acomputer program product according to claim 64, wherein, in the patternof each layer, the subsets are consecutive intervals consisting of thesame number of integers.
 66. A computer program product according toclaim 65, wherein said number of integers is a whole power of 2 for eachlayer.
 67. A computer program product according to claim 64, wherein, inthe coding scheme, the coding data representing the position of eachinteger of an input list within a subset for the coding layer n define alayer n bitmap segment in which each bit is associated with a respectiveinteger of the subset to indicate whether said integer belongs to saidinput list, while the data representing the position of said subset inthe layer n pattern comprise a layer n integer rank associated with saidlayer n bitmap segment, the computer program product comprisinginstructions for determining the layer n intermediary list for anintermediary list corresponding to at least one first integer listrepresented by stored coding data, said instructions for determining thelayer n intermediary list comprising: instructions for initializing alayer n bitmap vector with logical zeroes; instructions for obtainingthe layer n ranks and associated bitmap segments from said stored codingdata; and for each of said layer n ranks, instructions for superimposingthe layer n bitmap segment associated therewith onto a segment of saidlayer n bitmap vector having a position determined by said layer n rank,the superimposition being performed according to a bitwise Boolean ORoperation, said layer n intermediary list corresponding to the resultinglayer n bitmap vector.
 68. A computer program product according to claim67, wherein n>1 and in the coding scheme, the coding data representingthe position of each integer of an input list within a subset for acoding layer k<n define a layer k bitmap segment in which each bit isassociated with a respective integer of the subset to indicate whethersaid integer belongs to said input list, while the coding data furthercomprise a layer k integer rank associated with said layer k bitmapsegment to represent the position of said subset in the layer k pattern,the computer program product comprising, for k<n, instructions fordetermining the layer k intermediary list for an intermediary listcorresponding to at least one first integer list represented by storedcoding data, said instructions for determining the layer k intermediarylist comprising: instructions for initializing a layer k bitmap vectorwith logical zeroes; instructions for obtaining the layer k ranks fromsaid stored coding data; and instructions for selecting any obtainedlayer k rank belonging to the layer k+1 result list and forsuperimposing the associated layer k bitmap segment onto a segment ofsaid layer k bitmap vector having a position determined by the selectedlayer k rank, the superimposition being performed according to a bitwiseBoolean OR operation, said layer k intermediary list corresponding tothe resulting layer k bitmap vector.
 69. A computer program productaccording to claim 68, wherein, for 1≦k<n, the layer k ranks and thelayer k bitmap segments associated therewith are stored at correspondingaddresses in distinct first and second files, and said instructions fordetermining the layer k intermediary list for an intermediary listcorresponding to at least one first integer list represented by storedcoding data comprises: instructions for providing a rank table in a RAMmemory, having records associated with the addresses in said first andsecond files; instructions for filling the rank table by writing anyselected layer k rank into the rank table record associated with theaddress of the selected layer k rank in said first file; and for anyrecord of the filled rank table containing a layer k rank and associatedwith an address in the second file, instructions for reading theassociated layer k bitmap segment at said address in the second file andfor superimposing the read layer k bitmap segment onto a segment of saidlayer k bitmap vector having a position determined by said layer k rank.70. A computer program product according to claim 67, furthercomprising, for any coding layer k such that 1<k≦n, instructions fordetermining a layer k′ filtering list for k≦k′≦n, said layer k′filtering list being the layer k′ input list obtained by providing thelayer k result list as an input list in layer k of the coding scheme,wherein, in the coding scheme, the coding data representing the positionof each integer of an input list within a subset for a coding layer k<ndefine a layer k bitmap segment in which each bit is associated with arespective integer of the subset to indicate whether said integerbelongs to said input list, while a layer k integer rank associated withsaid layer k bitmap segment represents the position of said subset inthe layer k pattern, the computer program product comprising, for k<n,instructions for determining the layer k intermediary list for anintermediary list corresponding to at least one first integer listrepresented by stored coding data, said instructions for determining thelayer k intermediary list comprising: /a/ instructions for initializinga layer k bitmap vector with logical zeroes; /b/ instructions forselecting the layer n ranks obtained from said stored coding data, andfor setting k′=n; /c/ for each selected layer k′ rank: /c1/ if theselected layer k′ rank represents the position in the layer k′ patternof a subset which includes at least one integer of the layer k′filtering list, instructions for obtaining the layer k′ bitmap segmentwith which the selected layer k′ rank is associated; /c2/ for anyinteger of the layer k′ filtering list whose position within said subsetis represented in said layer k′ bitmap segment, instructions forselecting a respective layer k′−1 rank determined from the selectedlayer k′ rank and said position represented in said layer k′ bitmapsegment; /c3/ if k′>k+1, instructions for executing the instructions /c/with k′ decremented by one unit; and /c4/ if k′−1=k, instructions forobtaining any layer k bitmap segment with which a selected layer k′−1rank is associated, and for superimposing said layer k bitmap segmentonto a segment of said layer k bitmap vector having a positiondetermined by said selected layer k′−1 rank, the superimposition beingperformed according to a bitwise Boolean OR operation, said layer kintermediary list corresponding to the resulting layer k bitmap vector.71. A computer program product according to claim 70, furthercomprising, for 1≦k<n, instructions for storing the layer k bitmapsegments in at least one layer k file at addresses respectivelycorresponding to the layer k ranks associated therewith, and saidinstructions for determining the layer k intermediary list for anintermediary list corresponding to at least one first integer listrepresented by stored coding data comprises: instructions for providinga rank table in a RAM memory, having records associated with theaddresses in said layer k file; instructions for filling the rank tableby writing any selected layer k rank into the rank table recordassociated with the address corresponding to the selected layer k rank;and for any record of the filled rank table containing a layer k rankand associated with an address in said layer k file, instructions forreading the associated layer k bitmap segment at said address and forsuperimposing the read layer k bitmap segment onto a segment of saidlayer k bitmap vector having a position determined by said layer k rank.72. A computer program product according to claim 64, wherein theintermediary lists include at least one preset list, said preset listconsisting of one of the first integer lists for which the layer k inputlists, according to the coding scheme, are determined in advance for1≦k≦n, said layer k input lists being the respective layer kintermediary lists corresponding to said preset list.